## here() starts at /Users/emmajuansalazar/arxius/projects/ge-analysis-bronchial-tissue
Lung cancer is the leading cause of cancer-related death in the United States. The molecular events preceding the onset of disease are poorly understood, and no effective tools exist to identify smokers with premalignant lesions (PMLs) that will progress to invasive cancer. Previous research has identified molecular changes in the smoke-exposed airways. However, the focus of the paper “Detecting the Presence and Progression of Premalignant Lung Lesions via Airway Gene Expression” Beane et al. (2017) shifts the focus to the first stages of the disease progress. They utilize samples from the same airway field of injury to study PMLs and their potential in preventing lung cancer.
The researchers’ goal is to profile samples extracted from normal appearing bronchial mucosa by mRNA-Seq, and attempt to find differentially expressed gene sets and pathways between subjects with PMLs and without PMLs. From the original 82 samples, 50 subjects had PML, 25 didn’t, and 7 were excluded in the end.
The resulting raw RNA-seq data has been deposited at the Gene Expression Omnibus (GEO), where they are publicly available under accession GSE79209.
In this analysis, we attempted to formulate a new question to answer, using the dataset of the mentioned paper.
We’ll perform an Exploratory Data Analysis of the corresponding RNA-seq gene expression profiles generated as a table of counts using the DEE2 (https://dee2.io) pipeline by Ziemann, Kaspi, and El-Osta (2019), and further packaged into a SummarizedExperiment object with genes mapped to Entrez identifiers. This object also stores the phenotypic information about the profiled samples that has been also made available at GEO.
The EDA will allow us to look further into the dataset and have enough information for a new Experimental Design. Hierarchical clustering and batch identification will assist us in choosing a new way to group the 82 samples, with the goal of finding differentially expressed genes. We will perform functional enrichment analysis to analyse high-throughput experimental results in order to discover biological annotations that are over-represented in the list of DE genes.
We start by importing the raw table of counts.
library(SummarizedExperiment)
se <- readRDS("data/GSE79209.rds")
We have 25122 genes by 81 samples. From the first row and column names shown by the object, we can figure out that genes are defined by Entrez (Maglott et al. 2010) identifiers and samples by Sequence Read Archive Run (SRR) identifiers.
The row data in this object contains information about the profiled genes.
head(rowData(se))
DataFrame with 6 rows and 5 columns
gene_id gene_biotype description
<character> <character> <character>
1 ENSG00000121410 protein_coding alpha-1-B glycoprote..
10 ENSG00000156006 protein_coding N-acetyltransferase ..
100 ENSG00000196839 protein_coding adenosine deaminase ..
1000 ENSG00000170558 protein_coding cadherin 2 [Source:H..
10000 ENSG00000117020 protein_coding AKT serine/threonine..
100008586 ENSG00000236362 protein_coding G antigen 12F [Sourc..
gene_id_version symbol
<character> <character>
1 ENSG00000121410.11 A1BG
10 ENSG00000156006.4 NAT2
100 ENSG00000196839.12 ADA
1000 ENSG00000170558.8 CDH2
10000 ENSG00000117020.16 AKT3
100008586 ENSG00000236362.8 GAGE12F
Among this information, the gene symbol and description are potentially useful for interpreting results of, for instance, a differential expression analysis.
Let’s explore the information we have for the samples: phenotypical and technical data.
Conducting exploratory data analysis (EDA) will enable us to identify the number of phenotypical variables, the technical variables, and potential areas of interest for future experimental design.
dim(colData(se))
[1] 81 68
head(colData(se), n=3)
DataFrame with 3 rows and 68 columns
title geo_accession status
<character> <character> <character>
SRR3225255 Subject L1, dysplasia GSM2088002 Public on May 25 2017
SRR3225256 Subject L10, dysplasia GSM2088003 Public on May 25 2017
SRR3225257 Subject L11, dysplasia GSM2088004 Public on May 25 2017
submission_date last_update_date type channel_count
<character> <character> <character> <character>
SRR3225255 Mar 14 2016 May 15 2019 SRA 1
SRR3225256 Mar 14 2016 May 15 2019 SRA 1
SRR3225257 Mar 14 2016 May 15 2019 SRA 1
source_name_ch1 organism_ch1 characteristics_ch1
<character> <character> <character>
SRR3225255 Bronchial brushing Homo sapiens tissue: Bronchial br..
SRR3225256 Bronchial brushing Homo sapiens tissue: Bronchial br..
SRR3225257 Bronchial brushing Homo sapiens tissue: Bronchial br..
characteristics_ch1.1 characteristics_ch1.2 characteristics_ch1.3
<character> <character> <character>
SRR3225255 age: 69 Sex: Male smoking status: Form..
SRR3225256 age: 50 Sex: Female smoking status: Curr..
SRR3225257 age: 49 Sex: Male smoking status: Curr..
characteristics_ch1.4 characteristics_ch1.5 characteristics_ch1.6
<character> <character> <character>
SRR3225255 pack years: 58 copd status: No COPD max histology: Mild ..
SRR3225256 pack years: 34 copd status: No COPD max histology: Mild ..
SRR3225257 pack years: 35 copd status: No COPD max histology: Moder..
characteristics_ch1.7 characteristics_ch1.8 characteristics_ch1.9
<character> <character> <character>
SRR3225255 dysplasia status: Dy.. total alignments: 91.. unique alignments: 8..
SRR3225256 dysplasia status: Dy.. total alignments: 96.. unique alignments: 8..
SRR3225257 dysplasia status: Dy.. total alignments: 12.. unique alignments: 1..
characteristics_ch1.10 characteristics_ch1.11 characteristics_ch1.12
<character> <character> <character>
SRR3225255 read 1 alignment: 42.. read 2 alignment: 41.. positive strand alig..
SRR3225256 read 1 alignment: 45.. read 2 alignment: 43.. positive strand alig..
SRR3225257 read 1 alignment: 56.. read 2 alignment: 54.. positive strand alig..
characteristics_ch1.13 characteristics_ch1.14 characteristics_ch1.15
<character> <character> <character>
SRR3225255 negative strand alig.. non-splice alignment.. splice alignments: 2..
SRR3225256 negative strand alig.. non-splice alignment.. splice alignments: 2..
SRR3225257 negative strand alig.. non-splice alignment.. splice alignments: 3..
characteristics_ch1.16 characteristics_ch1.17 characteristics_ch1.18
<character> <character> <character>
SRR3225255 properly paired alig.. genebody 80/20 ratio.. mean gc content: 53.54
SRR3225256 properly paired alig.. genebody 80/20 ratio.. mean gc content: 45.45
SRR3225257 properly paired alig.. genebody 80/20 ratio.. mean gc content: 44.44
molecule_ch1 extract_protocol_ch1 extract_protocol_ch1.1
<character> <character> <character>
SRR3225255 polyA RNA Bronchial airway bru.. Sequencing libraries..
SRR3225256 polyA RNA Bronchial airway bru.. Sequencing libraries..
SRR3225257 polyA RNA Bronchial airway bru.. Sequencing libraries..
taxid_ch1 description data_processing
<character> <character> <character>
SRR3225255 9606 Sample_L1 Demultiplexing and c..
SRR3225256 9606 Sample_L10 Demultiplexing and c..
SRR3225257 9606 Sample_L11 Demultiplexing and c..
data_processing.1 data_processing.2 data_processing.3
<character> <character> <character>
SRR3225255 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
SRR3225256 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
SRR3225257 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
data_processing.4 data_processing.5 data_processing.6
<character> <character> <character>
SRR3225255 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
SRR3225256 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
SRR3225257 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
platform_id data_row_count instrument_model library_selection
<character> <character> <character> <character>
SRR3225255 GPL16791 0 Illumina HiSeq 2500 cDNA
SRR3225256 GPL16791 0 Illumina HiSeq 2500 cDNA
SRR3225257 GPL16791 0 Illumina HiSeq 2500 cDNA
library_source library_strategy relation
<character> <character> <character>
SRR3225255 transcriptomic RNA-Seq SRA: https://www.ncb..
SRR3225256 transcriptomic RNA-Seq SRA: https://www.ncb..
SRR3225257 transcriptomic RNA-Seq SRA: https://www.ncb..
relation.1 supplementary_file_1 age:ch1
<character> <character> <character>
SRR3225255 BioSample: https://w.. NONE 69
SRR3225256 BioSample: https://w.. NONE 50
SRR3225257 BioSample: https://w.. NONE 49
copd status:ch1 dysplasia status:ch1 genebody 80/20 ratio:ch1
<character> <character> <character>
SRR3225255 No COPD Dysplasia 1.34213035203572
SRR3225256 No COPD Dysplasia 1.25508564705863
SRR3225257 No COPD Dysplasia 1.20885950984169
max histology:ch1 mean gc content:ch1
<character> <character>
SRR3225255 Mild dysplasia 53.54
SRR3225256 Mild dysplasia 45.45
SRR3225257 Moderate dysplasia 44.44
negative strand alignments:ch1 non-splice alignments:ch1
<character> <character>
SRR3225255 41653617 62567474
SRR3225256 44222609 64646471
SRR3225257 55046872 79307301
pack years:ch1 positive strand alignments:ch1
<character> <character>
SRR3225255 58 41837114
SRR3225256 34 44366147
SRR3225257 35 55244102
properly paired alignments:ch1 read 1 alignment:ch1
<character> <character>
SRR3225255 66570222 42319429
SRR3225256 70579486 45257546
SRR3225257 87409558 56124363
read 2 alignment:ch1 Sex:ch1 smoking status:ch1
<character> <character> <character>
SRR3225255 41171302 Male Former smoker
SRR3225256 43331210 Female Current smoker
SRR3225257 54166611 Male Current smoker
splice alignments:ch1 tissue:ch1 total alignments:ch1
<character> <character> <character>
SRR3225255 20923257 Bronchial brushing 91000281
SRR3225256 23942285 Bronchial brushing 96810722
SRR3225257 30983673 Bronchial brushing 120861113
unique alignments:ch1
<character>
SRR3225255 83490731
SRR3225256 88588756
SRR3225257 110290974
We have a total of 68 phenotypical variables.
To get a sense of what we have, we can look at the column names:
colnames(colData(se))
[1] "title" "geo_accession"
[3] "status" "submission_date"
[5] "last_update_date" "type"
[7] "channel_count" "source_name_ch1"
[9] "organism_ch1" "characteristics_ch1"
[11] "characteristics_ch1.1" "characteristics_ch1.2"
[13] "characteristics_ch1.3" "characteristics_ch1.4"
[15] "characteristics_ch1.5" "characteristics_ch1.6"
[17] "characteristics_ch1.7" "characteristics_ch1.8"
[19] "characteristics_ch1.9" "characteristics_ch1.10"
[21] "characteristics_ch1.11" "characteristics_ch1.12"
[23] "characteristics_ch1.13" "characteristics_ch1.14"
[25] "characteristics_ch1.15" "characteristics_ch1.16"
[27] "characteristics_ch1.17" "characteristics_ch1.18"
[29] "molecule_ch1" "extract_protocol_ch1"
[31] "extract_protocol_ch1.1" "taxid_ch1"
[33] "description" "data_processing"
[35] "data_processing.1" "data_processing.2"
[37] "data_processing.3" "data_processing.4"
[39] "data_processing.5" "data_processing.6"
[41] "platform_id" "data_row_count"
[43] "instrument_model" "library_selection"
[45] "library_source" "library_strategy"
[47] "relation" "relation.1"
[49] "supplementary_file_1" "age:ch1"
[51] "copd status:ch1" "dysplasia status:ch1"
[53] "genebody 80/20 ratio:ch1" "max histology:ch1"
[55] "mean gc content:ch1" "negative strand alignments:ch1"
[57] "non-splice alignments:ch1" "pack years:ch1"
[59] "positive strand alignments:ch1" "properly paired alignments:ch1"
[61] "read 1 alignment:ch1" "read 2 alignment:ch1"
[63] "Sex:ch1" "smoking status:ch1"
[65] "splice alignments:ch1" "tissue:ch1"
[67] "total alignments:ch1" "unique alignments:ch1"
The second column geo_accession contains GEO Sample Accession Number
(GSM) identifers.
GSM identifiers define individual samples, understood in our context as
individual sources of RNA. We can check if we have any technical
replicates as follows:
length(unique(se$geo_accession))
[1] 81
table(lengths(split(colnames(se), se$geo_accession)))
1
81
So, we have 81 different individual samples which matches the number of samples that we have - 81. Therefore we don’t have any technical replicates and we can proceed with the next step.
To proceed further exploring the dataset, we are going to use the
edgeR package and build a
DGEList object, incorporating the gene metadata, which includes the
gene symbol.
library(edgeR)
dge <- DGEList(counts=assays(se)$counts, genes=rowData(se))
head(dge$samples, 3)
group lib.size norm.factors
SRR3225255 1 35504068 1
SRR3225256 1 39395250 1
SRR3225257 1 49591671 1
dim(dge)
[1] 25122 81
Calculate \(\log_2\) CPM units of expression and put them as an additional assay element to ease their manipulation. This provides a standarized and normalized representation of gene expression data.
assays(se)$logCPM <- cpm(dge, log=TRUE)
assays(se)$logCPM[1:5, 1:5]
SRR3225255 SRR3225256 SRR3225257 SRR3225258 SRR3225259
1 -2.8144890 -2.9017808 -4.1164076 -2.4911549 -3.436999
10 -1.8212210 -2.6511767 -3.6837191 -2.1459416 -1.077656
100 -0.0186939 0.9340788 0.2471703 -0.6559511 1.072922
1000 1.4206565 2.2413119 3.0040881 0.5618081 2.296503
10000 1.3426936 2.2490363 2.1152217 2.2466499 1.900561
Let’s explore now some of the phenotypical variables.
To identify the columns that will be useful, we need to examine the number of unique values (levels) of each variable to determine their suitability for grouping samples. Variables with only one unique value or more than 80 unique values are not of interest. Therefore, we aim to create a list of columns with an appropriate number of unique values.
interesting_se <- data.frame(row.names=se$title)
technical_se <- data.frame(row.names=se$title)
for(column in colnames(colData(se))){
se[[column]] <- factor(se[[column]])
if(length(levels(se[[column]]))>1 && length(levels(se[[column]]))<81){
interesting_se[[column]] <- se[[column]]
} else{
technical_se[[column]] <- se[[column]]
}
}
Once we obtain the variables, we can now look at their levels to see what type of data they offer
head(interesting_se)
characteristics_ch1.1 characteristics_ch1.2
Subject L1, dysplasia age: 69 Sex: Male
Subject L10, dysplasia age: 50 Sex: Female
Subject L11, dysplasia age: 49 Sex: Male
Subject L12, dysplasia age: 57 Sex: Male
Subject L13, dysplasia age: 64 Sex: Female
Subject L14, dysplasia age: 69 Sex: Male
characteristics_ch1.3 characteristics_ch1.4
Subject L1, dysplasia smoking status: Former smoker pack years: 58
Subject L10, dysplasia smoking status: Current smoker pack years: 34
Subject L11, dysplasia smoking status: Current smoker pack years: 35
Subject L12, dysplasia smoking status: Current smoker pack years: 40
Subject L13, dysplasia smoking status: Former smoker pack years: 37.58
Subject L14, dysplasia smoking status: Current smoker pack years: 65
characteristics_ch1.5 characteristics_ch1.6
Subject L1, dysplasia copd status: No COPD max histology: Mild dysplasia
Subject L10, dysplasia copd status: No COPD max histology: Mild dysplasia
Subject L11, dysplasia copd status: No COPD max histology: Moderate dysplasia
Subject L12, dysplasia copd status: No COPD max histology: Mild dysplasia
Subject L13, dysplasia copd status: No COPD max histology: Mild dysplasia
Subject L14, dysplasia copd status: COPD max histology: Moderate dysplasia
characteristics_ch1.7 characteristics_ch1.18
Subject L1, dysplasia dysplasia status: Dysplasia mean gc content: 53.54
Subject L10, dysplasia dysplasia status: Dysplasia mean gc content: 45.45
Subject L11, dysplasia dysplasia status: Dysplasia mean gc content: 44.44
Subject L12, dysplasia dysplasia status: Dysplasia mean gc content: 45.45
Subject L13, dysplasia dysplasia status: Dysplasia mean gc content: 47.47
Subject L14, dysplasia dysplasia status: Dysplasia mean gc content: 49.49
age:ch1 copd status:ch1 dysplasia status:ch1
Subject L1, dysplasia 69 No COPD Dysplasia
Subject L10, dysplasia 50 No COPD Dysplasia
Subject L11, dysplasia 49 No COPD Dysplasia
Subject L12, dysplasia 57 No COPD Dysplasia
Subject L13, dysplasia 64 No COPD Dysplasia
Subject L14, dysplasia 69 COPD Dysplasia
max histology:ch1 mean gc content:ch1 pack years:ch1
Subject L1, dysplasia Mild dysplasia 53.54 58
Subject L10, dysplasia Mild dysplasia 45.45 34
Subject L11, dysplasia Moderate dysplasia 44.44 35
Subject L12, dysplasia Mild dysplasia 45.45 40
Subject L13, dysplasia Mild dysplasia 47.47 37.58
Subject L14, dysplasia Moderate dysplasia 49.49 65
Sex:ch1 smoking status:ch1
Subject L1, dysplasia Male Former smoker
Subject L10, dysplasia Female Current smoker
Subject L11, dysplasia Male Current smoker
Subject L12, dysplasia Male Current smoker
Subject L13, dysplasia Female Former smoker
Subject L14, dysplasia Male Current smoker
Using this list, we identify 8 phenotypical (or environmental) variables of interest:
Age (characteristics_ch1.1)
Sex (characteristics_ch1.2)
Smoking status (characteristics_ch1.3): Current smoker or Former smoker
Pack years (characteristics_ch1.4): how many cigarettes a person has smoked in their lifetime
COPD Status (characteristics_ch1.1): Chronic obstructive pulmonary disease
Mean GC Content (characteristics_ch1.18)
Max Histology (characteristics_ch1.6): Histology types (tissue characteristics)
Dysplasia includes samples that have:
Mild Dysplasia
Moderate Dysplasia
Severe Dysplasia
No Dysplasia includes samples that have:
Hyperplasia
Normal histology
Samples that had metaplasia were excluded from the analysis
Dysplasia Status (characteristics_ch1.7)
Dysplasia
No Dysplasia
Metaplasia
To facilitate handling this variables we are going to perform the following modifications:
se$sex <- gsub("Sex: ", "", se$characteristics_ch1.2)
se$sex <- factor(se$sex, levels = c("Male", "Female"))
se$smoker <- gsub("smoking status: ", "", se$characteristics_ch1.3)
se$smoker <- factor(se$smoker, levels = c("Current smoker", "Former smoker"))
se$copd_status <- gsub("copd status: ", "", se$characteristics_ch1.5)
se$copd_status <- factor(se$copd_status, levels = c("COPD", "No COPD"))
se$histology <- gsub("max histology: ", "", se$characteristics_ch1.6)
se$histology <- factor(se$histology, levels(se$histology) <- c("Hyperplasia", "Metaplasia", "Mild dysplasia", "Moderate dysplasia", "Normal", "Severe dysplasia"))
se$dysplasia <- gsub("Subject L.*, ", "", se$title)
se$dysplasia <- factor(se$dysplasia)
To faciliate working with the pack years variable, we will split it into four categorical groups - Low, Medium, High and Very high.
To do that we will:
1. Calculate quartiles for the data.
2. Assign categories based on quartile ranges.
3. Convert the values into their corresponding categories.
se$characteristics_ch1.4 <- gsub("pack years: ", "", se$characteristics_ch1.4)
class(se$characteristics_ch1.4)
[1] "character"
se$characteristics_ch1.4 <- as.numeric(se$characteristics_ch1.4)
quartiles <- quantile(se$characteristics_ch1.4, probs = c(0.25, 0.5, 0.75))
Now that we have the quartiles, we can proceed to assign categories to the values based on these quartiles.
Values below the first quartile (Q1) will be categorized as “low.” Values between Q1 and the median (Q2) will be categorized as “medium.” Values between the median (Q2) and the third quartile (Q3) will be categorized as “high.” Values above the third quartile (Q3) will be categorized as “very high.”
se$pack_years <- cut(se$characteristics_ch1.4,
breaks = c(-Inf, quartiles[1], quartiles[2], quartiles[3], Inf),
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE)
We will do the same with the age variable -split it into four intervals.
To do that we will:
1. Calculate quartiles for the data.
2. Assign categories based on quartile ranges.
3. Convert the values into their corresponding categories.
se$age <- gsub("age: ", "", se$characteristics_ch1.1)
class(se$age)
[1] "character"
se$age <- as.numeric(se$age)
quartiles <- quantile(se$age, probs = c(0.25, 0.5, 0.75))
quartiles
25% 50% 75%
58 64 69
Values below the first quartile (Q1) will be categorized as “low.” Values between Q1 and the median (Q2) will be categorized as “medium.” Values between the median (Q2) and the third quartile (Q3) will be categorized as “high.” Values above the third quartile (Q3) will be categorized as “very high.”
se$age_intervals <- cut(se$age,
breaks = c(-Inf, quartiles[1], quartiles[2], quartiles[3], Inf),
labels = c("age <= 58", "58 < age <= 64", "64 < age <= 69", "69 < age"),
include.lowest = TRUE)
In Table ?? below, we show the five variables of interest (smoking status, pack years, histology, dysplasia and COPD status) jointly to gather as much understanding as possible on the underlying experimental design.
In the following step we will evaluate the sequencing depth in terms of the total number of read counts that are mapped to each sample’s genome. The sequencing depth per sample, commonly referred to as library sizes, is displayed in Figure 1 below in increasing order.
Figure 1: Library sizes in increasing order
The plot displays a range of sequencing depths for the samples, each represented by an SRR identifier. The sequencing depth across our samples ranges from 20 to 50 million reads, with the majority falling between 30 to 40 million reads. This indicates a generally high level of coverage, which is beneficial for accurate genomic analysis. The plot also suggests that the highest number of reads corresponds to mild and moderate dysplasia, indicating early stages of cellular abnormalities that could potentially progress to cancer. The predominance of mild dysplasia in these samples could be of particular interest for early detection and intervention studies. There does not appear to be any evident bias in terms of sequencing depth across the conditions represented.
Figure 2 below shows the distribution of expression values per sample in logarithmic CPM units of expression.
Figure 2: Non-parametric density distribution of expression profiles per sample
There are no substantial differences between the samples in the distribution of expression values.
Let’s calculate now the average expression per gene through all the samples. Figure 3 shows the distribution of those values across genes.
Figure 3: Distribution of average expression level per gene
The histogram displays a bimodal distribution, which means there are two distinct peaks. The large peak to the left is the most prominent feature of the histogram, showing that most genes in this dataset are expressed at low levels or possibly not at all, under the conditions of the experiment. On the right part, the peak which indicates that as gene expression levels increase, fewer genes are found at these higher expression levels.
To filter out lowly-expressed genes, we first calculate the row means of the expression data and apply a logical mask to remove genes with an average logCPM ≤ 1. This step refines the dataset and removes noise, ensuring subsequent analysis is based on high-quality data.
Next, we calculate a CPM cutoff based on the smallest library size. We check for any missing values in the histology data to ensure data integrity.
We then focus on filtering genes based on their presence in samples with premalignant lesions (PML), specifically those categorized as “Mild dysplasia”, “Moderate dysplasia”, and “Severe dysplasia”. We count the number of PML samples and calculate the CPM for these samples only.
Finally, we apply a logical mask to retain genes that meet the CPM cutoff in at least the number of PML samples. This two-step filtering process is crucial for refining the dataset and focusing the analysis on genes with relevant expression patterns. Figure 4 shows the distribution of expression values across genes before and after filtering.
mask_initial <- rowMeans(assays(se)$logCPM) > 1
se_initial_filt <- se[mask_initial, ]
dge_initial_filt <- dge[mask_initial, ]
dim(se_initial_filt)
[1] 13374 81
cpmcutoff <- round(10/min(dge$samples$lib.size/1e6), digits=1)
cpmcutoff
[1] 0.4
if(any(is.na(se$histology))) {
stop("Missing values found in the histology column.")
}
pml_status <- se$histology %in% c("Mild dysplasia", "Moderate dysplasia", "Severe dysplasia")
nsamplescutoff <- sum(pml_status)
nsamplescutoff
[1] 49
pml_indices <- which(pml_status)
cpm_pml <- cpm(dge_initial_filt)[, pml_indices]
mask_final <- rowSums(cpm_pml > cpmcutoff) >= nsamplescutoff
sum(mask_final)
[1] 13167
se.filt <- se_initial_filt[mask_final, ]
dge.filt <- dge_initial_filt[mask_final, ]
dim(se.filt)
[1] 13167 81
Figure 4: Distribution of lowly-expressed genes
We are left with 13167 genes.
We calculate the normalization factors on the filtered expression data set.
dge.filt <- calcNormFactors(dge.filt)
Replace the raw log2 CPM units in the corresponding assay element of the
SummarizedExperiment object, by the normalized ones.
assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE,
normalized.lib.sizes=TRUE)
We examine now the MA-plots of the normalized expression profiles in the figures below.
Figure 5: MA-plots of filtered and normalized expression values
Figure 6: MA-plots of filtered and normalized expression values
Figure 7: MA-plots of filtered and normalized expression values
Figure 8: MA-plots of filtered and normalized expression values
Figure 9: MA-plots of filtered and normalized expression values
For genes above the horizontal line at M=0, expression is higher in the condition being tested against the reference, as for genes below, expression is lower. The red line represents a loose fit to the data, which should ideally be around M=0 if there is no differential expression between conditions. Any systematic deviation from M=0 indicates potential bias or systematic variation in the data. In most of the plots, the red line remains close to M=0 across the range of A values, indicating no systematic bias in expression changes relative to gene abundance. Some plots show a slight dip in the loose line at the lower end of the A scale, suggesting a potential bias for lowly expressed genes which could be potential candidates for differential expression. These points would be of interest for follow-up analyses.
To identify potential columns which could constitute unwanted variability, we looked at columns that contain information related to technical aspects, experimental conditions or processing details that could vary between different batches.
When doing EDA, we created a dataframe with all the phenotypical variables, and another one with the technical ones.
head(technical_se, n=3)
title geo_accession
Subject L1, dysplasia Subject L1, dysplasia GSM2088002
Subject L10, dysplasia Subject L10, dysplasia GSM2088003
Subject L11, dysplasia Subject L11, dysplasia GSM2088004
status submission_date last_update_date
Subject L1, dysplasia Public on May 25 2017 Mar 14 2016 May 15 2019
Subject L10, dysplasia Public on May 25 2017 Mar 14 2016 May 15 2019
Subject L11, dysplasia Public on May 25 2017 Mar 14 2016 May 15 2019
type channel_count source_name_ch1 organism_ch1
Subject L1, dysplasia SRA 1 Bronchial brushing Homo sapiens
Subject L10, dysplasia SRA 1 Bronchial brushing Homo sapiens
Subject L11, dysplasia SRA 1 Bronchial brushing Homo sapiens
characteristics_ch1 characteristics_ch1.8
Subject L1, dysplasia tissue: Bronchial brushing total alignments: 91000281
Subject L10, dysplasia tissue: Bronchial brushing total alignments: 96810722
Subject L11, dysplasia tissue: Bronchial brushing total alignments: 120861113
characteristics_ch1.9 characteristics_ch1.10
Subject L1, dysplasia unique alignments: 83490731 read 1 alignment: 42319429
Subject L10, dysplasia unique alignments: 88588756 read 1 alignment: 45257546
Subject L11, dysplasia unique alignments: 110290974 read 1 alignment: 56124363
characteristics_ch1.11
Subject L1, dysplasia read 2 alignment: 41171302
Subject L10, dysplasia read 2 alignment: 43331210
Subject L11, dysplasia read 2 alignment: 54166611
characteristics_ch1.12
Subject L1, dysplasia positive strand alignments: 41837114
Subject L10, dysplasia positive strand alignments: 44366147
Subject L11, dysplasia positive strand alignments: 55244102
characteristics_ch1.13
Subject L1, dysplasia negative strand alignments: 41653617
Subject L10, dysplasia negative strand alignments: 44222609
Subject L11, dysplasia negative strand alignments: 55046872
characteristics_ch1.14
Subject L1, dysplasia non-splice alignments: 62567474
Subject L10, dysplasia non-splice alignments: 64646471
Subject L11, dysplasia non-splice alignments: 79307301
characteristics_ch1.15
Subject L1, dysplasia splice alignments: 20923257
Subject L10, dysplasia splice alignments: 23942285
Subject L11, dysplasia splice alignments: 30983673
characteristics_ch1.16
Subject L1, dysplasia properly paired alignments: 66570222
Subject L10, dysplasia properly paired alignments: 70579486
Subject L11, dysplasia properly paired alignments: 87409558
characteristics_ch1.17 molecule_ch1
Subject L1, dysplasia genebody 80/20 ratio: 1.34213035203572 polyA RNA
Subject L10, dysplasia genebody 80/20 ratio: 1.25508564705863 polyA RNA
Subject L11, dysplasia genebody 80/20 ratio: 1.20885950984169 polyA RNA
extract_protocol_ch1
Subject L1, dysplasia Bronchial airway brushings were obtained during autofluorescence bronchoscopy. Autofluorescence bronchoscopy and endobronchial biopsy were used to identify and sample PMLs (if present). The pathological grade of the lesions with the worst histology present was recorded. Total RNA was extracted from bronchial brushings using miRNeasy Mini kit (Qiagen) according to manufacturer's protocol.
Subject L10, dysplasia Bronchial airway brushings were obtained during autofluorescence bronchoscopy. Autofluorescence bronchoscopy and endobronchial biopsy were used to identify and sample PMLs (if present). The pathological grade of the lesions with the worst histology present was recorded. Total RNA was extracted from bronchial brushings using miRNeasy Mini kit (Qiagen) according to manufacturer's protocol.
Subject L11, dysplasia Bronchial airway brushings were obtained during autofluorescence bronchoscopy. Autofluorescence bronchoscopy and endobronchial biopsy were used to identify and sample PMLs (if present). The pathological grade of the lesions with the worst histology present was recorded. Total RNA was extracted from bronchial brushings using miRNeasy Mini kit (Qiagen) according to manufacturer's protocol.
extract_protocol_ch1.1
Subject L1, dysplasia Sequencing libraries were prepared from total RNA samples using Illumina® TruSeq® RNA Sample Preparation Kit v2. The libraries from individual samples were pooled in groups of four for cluster generation on the Illumina® cBot® using Illumina® TruSeq® Paired-End Cluster Kit. Each sample was sequenced on the Illumina® HiSeq® 2500 to generate paired-end 100 nucleotide reads
Subject L10, dysplasia Sequencing libraries were prepared from total RNA samples using Illumina® TruSeq® RNA Sample Preparation Kit v2. The libraries from individual samples were pooled in groups of four for cluster generation on the Illumina® cBot® using Illumina® TruSeq® Paired-End Cluster Kit. Each sample was sequenced on the Illumina® HiSeq® 2500 to generate paired-end 100 nucleotide reads
Subject L11, dysplasia Sequencing libraries were prepared from total RNA samples using Illumina® TruSeq® RNA Sample Preparation Kit v2. The libraries from individual samples were pooled in groups of four for cluster generation on the Illumina® cBot® using Illumina® TruSeq® Paired-End Cluster Kit. Each sample was sequenced on the Illumina® HiSeq® 2500 to generate paired-end 100 nucleotide reads
taxid_ch1 description
Subject L1, dysplasia 9606 Sample_L1
Subject L10, dysplasia 9606 Sample_L10
Subject L11, dysplasia 9606 Sample_L11
data_processing
Subject L1, dysplasia Demultiplexing and creation of FASTQ files was performed using Illumina CASAVA v1.8.2 software.
Subject L10, dysplasia Demultiplexing and creation of FASTQ files was performed using Illumina CASAVA v1.8.2 software.
Subject L11, dysplasia Demultiplexing and creation of FASTQ files was performed using Illumina CASAVA v1.8.2 software.
data_processing.1
Subject L1, dysplasia The FASTQ files were aligned to hg19 using TopHat v2.0.4 software and default parameters in addition to the mean insert size and standard deviation determined using MISO based on an initial TopHat alignment.
Subject L10, dysplasia The FASTQ files were aligned to hg19 using TopHat v2.0.4 software and default parameters in addition to the mean insert size and standard deviation determined using MISO based on an initial TopHat alignment.
Subject L11, dysplasia The FASTQ files were aligned to hg19 using TopHat v2.0.4 software and default parameters in addition to the mean insert size and standard deviation determined using MISO based on an initial TopHat alignment.
data_processing.2
Subject L1, dysplasia Alignment and quality metrics, including genebody ratio, were calculated using RSeQC v2.3.3.
Subject L10, dysplasia Alignment and quality metrics, including genebody ratio, were calculated using RSeQC v2.3.3.
Subject L11, dysplasia Alignment and quality metrics, including genebody ratio, were calculated using RSeQC v2.3.3.
data_processing.3
Subject L1, dysplasia Gene count estimates were derived using HTSeq-count v0.5.4 with Python v2.7.3 and the Ensembl v64 GTF annotation file.
Subject L10, dysplasia Gene count estimates were derived using HTSeq-count v0.5.4 with Python v2.7.3 and the Ensembl v64 GTF annotation file.
Subject L11, dysplasia Gene count estimates were derived using HTSeq-count v0.5.4 with Python v2.7.3 and the Ensembl v64 GTF annotation file.
data_processing.4
Subject L1, dysplasia Gene filtering was conducted on normalized counts per million (cpm) calculated using R v3.0.0 and edgeR v3.4.2 using a modified version of the mixture model in the SCAN.UPC Bioconductor package.
Subject L10, dysplasia Gene filtering was conducted on normalized counts per million (cpm) calculated using R v3.0.0 and edgeR v3.4.2 using a modified version of the mixture model in the SCAN.UPC Bioconductor package.
Subject L11, dysplasia Gene filtering was conducted on normalized counts per million (cpm) calculated using R v3.0.0 and edgeR v3.4.2 using a modified version of the mixture model in the SCAN.UPC Bioconductor package.
data_processing.5
Subject L1, dysplasia Genome_build: hg19
Subject L10, dysplasia Genome_build: hg19
Subject L11, dysplasia Genome_build: hg19
data_processing.6
Subject L1, dysplasia Supplementary_files_format_and_content: tab-delimited matrices includes raw count values for each Sample and each filtered Gene, CPM values for each Sample and each filtered Gene, RPKM values for each Sample and each filtered Gene, raw count values for each Sample and each filtered Gene, and RPKM values for each Sample and each filtered Gene
Subject L10, dysplasia Supplementary_files_format_and_content: tab-delimited matrices includes raw count values for each Sample and each filtered Gene, CPM values for each Sample and each filtered Gene, RPKM values for each Sample and each filtered Gene, raw count values for each Sample and each filtered Gene, and RPKM values for each Sample and each filtered Gene
Subject L11, dysplasia Supplementary_files_format_and_content: tab-delimited matrices includes raw count values for each Sample and each filtered Gene, CPM values for each Sample and each filtered Gene, RPKM values for each Sample and each filtered Gene, raw count values for each Sample and each filtered Gene, and RPKM values for each Sample and each filtered Gene
platform_id data_row_count instrument_model
Subject L1, dysplasia GPL16791 0 Illumina HiSeq 2500
Subject L10, dysplasia GPL16791 0 Illumina HiSeq 2500
Subject L11, dysplasia GPL16791 0 Illumina HiSeq 2500
library_selection library_source library_strategy
Subject L1, dysplasia cDNA transcriptomic RNA-Seq
Subject L10, dysplasia cDNA transcriptomic RNA-Seq
Subject L11, dysplasia cDNA transcriptomic RNA-Seq
relation
Subject L1, dysplasia SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX1631440
Subject L10, dysplasia SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX1631441
Subject L11, dysplasia SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX1631442
relation.1
Subject L1, dysplasia BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN04557034
Subject L10, dysplasia BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN04557035
Subject L11, dysplasia BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN04556976
supplementary_file_1 genebody 80/20 ratio:ch1
Subject L1, dysplasia NONE 1.34213035203572
Subject L10, dysplasia NONE 1.25508564705863
Subject L11, dysplasia NONE 1.20885950984169
negative strand alignments:ch1 non-splice alignments:ch1
Subject L1, dysplasia 41653617 62567474
Subject L10, dysplasia 44222609 64646471
Subject L11, dysplasia 55046872 79307301
positive strand alignments:ch1
Subject L1, dysplasia 41837114
Subject L10, dysplasia 44366147
Subject L11, dysplasia 55244102
properly paired alignments:ch1 read 1 alignment:ch1
Subject L1, dysplasia 66570222 42319429
Subject L10, dysplasia 70579486 45257546
Subject L11, dysplasia 87409558 56124363
read 2 alignment:ch1 splice alignments:ch1
Subject L1, dysplasia 41171302 20923257
Subject L10, dysplasia 43331210 23942285
Subject L11, dysplasia 54166611 30983673
tissue:ch1 total alignments:ch1
Subject L1, dysplasia Bronchial brushing 91000281
Subject L10, dysplasia Bronchial brushing 96810722
Subject L11, dysplasia Bronchial brushing 120861113
unique alignments:ch1
Subject L1, dysplasia 83490731
Subject L10, dysplasia 88588756
Subject L11, dysplasia 110290974
Here are some columns which we thought could introduce unwanted
variability: submission_date, last_update_date, type,
channel_count, extract_protocol_ch1, extract_protocol_ch1.1,
data_processing, data_processing.1, data_processing.2,
data_processing.3, data_processing.4, data_processing.5,
data_processing.6, platform_id, instrument_model,
library_selection, library_source, library_strategy.
These variables have the same values across all samples which suggests that they are not contributing to the variability of the dataset. For example:
table(se.filt$histology, se.filt$library_strategy)
RNA-Seq
Hyperplasia 13
Metaplasia 7
Mild dysplasia 35
Moderate dysplasia 11
Normal 12
Severe dysplasia 3
There is perfect correlation between library strategy and histology. This means that differences between our samples will be due to biological differences rather than technical.
We can discard all of the variables that have the same value across all samples:
deleted <- c()
for(column in colnames(technical_se)){
if(length(levels(technical_se[[column]]))==1){
technical_se[[column]] <- NULL
deleted <- c(deleted, c(column))
}
}
head(deleted)
[1] "status" "submission_date" "last_update_date" "type"
[5] "channel_count" "source_name_ch1"
The rest of the variables are related to technical aspects related to the sequencing that’s specific to every sample, therefore we can discard them as sources of unwanted variability.
We have to now take a look at how our phenotypical variables interact and whether they might explain any sample clustering we may find, to help us make a decision on what groups to choose for our DE analysis.
First, let’s examine the distribution between dysplasia and histology.
table(se.filt$dysplasia, se.filt$histology)
Hyperplasia Metaplasia Mild dysplasia Moderate dysplasia Normal
dysplasia 0 0 35 11 0
metaplasia 0 7 0 0 0
no dysplasia 13 0 0 0 12
Severe dysplasia
dysplasia 3
metaplasia 0
no dysplasia 0
This is a very obvious result, since we know dysplasia was drawn from the histology variable. This is a great example of combinations that aren’t sequenced (in this case, because they can’t exist.)
We can also examine the distribution between histology and our other prototypical variables
table(se.filt$histology, se.filt$age_intervals)
age <= 58 58 < age <= 64 64 < age <= 69 69 < age
Hyperplasia 3 4 3 3
Metaplasia 1 4 1 1
Mild dysplasia 15 5 8 7
Moderate dysplasia 3 2 3 3
Normal 2 3 3 4
Severe dysplasia 0 1 1 1
table(se.filt$histology, se.filt$sex)
Male Female
Hyperplasia 9 4
Metaplasia 3 4
Mild dysplasia 22 13
Moderate dysplasia 9 2
Normal 7 5
Severe dysplasia 3 0
table(se.filt$histology, se.filt$pack_years)
Low Medium High Very High
Hyperplasia 5 3 1 4
Metaplasia 2 1 2 2
Mild dysplasia 12 7 9 7
Moderate dysplasia 1 5 2 3
Normal 1 4 4 3
Severe dysplasia 0 0 2 1
table(se.filt$histology, se.filt$copd_status)
COPD No COPD
Hyperplasia 4 9
Metaplasia 2 5
Mild dysplasia 10 25
Moderate dysplasia 5 6
Normal 1 11
Severe dysplasia 2 1
We can see that not all combinations of pack-years and histology have been sequenced. Some combinations have missing values, indicated by 0s.
It’s interesting to note that the 0s indicate:
Groups like low or medium pack years (people that have smoked less cigarettes over the course of their lives) have no samples with severe dysplasia, meaning severe detriment to their bronchial tissue.
There’s also no females with severe dysplasia.
And the severe dysplasia group also lacks samples on the younger age interval.
The distribution we’re interested in, however, is the one between these phenotypical variables and the dysplasia variable (same concept as histology, but grouped in three groups)
table(se.filt$dysplasia, se.filt$age_intervals)
age <= 58 58 < age <= 64 64 < age <= 69 69 < age
dysplasia 18 8 12 11
metaplasia 1 4 1 1
no dysplasia 5 7 6 7
table(se.filt$dysplasia, se.filt$sex)
Male Female
dysplasia 34 15
metaplasia 3 4
no dysplasia 16 9
table(se.filt$dysplasia, se.filt$pack_years)
Low Medium High Very High
dysplasia 13 12 13 11
metaplasia 2 1 2 2
no dysplasia 6 7 5 7
table(se.filt$dysplasia, se.filt$copd_status)
COPD No COPD
dysplasia 17 32
metaplasia 2 5
no dysplasia 5 20
We can see that we have sequences for all of these different combinations.
We examine now how samples group together by hierarchical clustering and multidimensional scaling, annotating pack-years and histology. We calculate again log CPM values with a high prior count (3) to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 10.
Figure 10: Figure S6: Hierarchical clustering of the samples
Labels correspond to pack-years and sample identifer, while colors indicate histology groups.
Samples are linked together based on their similarity, with the branches connecting those most similar first. As we move up the tree, the connections represent progressively less similar groups until all are joined in a single cluster. Samples do not really cluster together by histology type, which means histology type is not the primary factor driving the similarity or dissimilarity between samples.
If we look more closely at the labels, we can also see even if we look at clustering through the 3 dysplasia groups instead of all 6 histology groups, there is still no clear clustering.
We can now try to identify a phenotypical variable that will justify clustering.
Figure 11: Figure S6: Hierarchical clustering of the samples
Labels correspond to dysplasia status, while colors indicate age groups.
Figure 12: Figure S6: Hierarchical clustering of the samples
Labels correspond to dysplasia status, while colors indicate sex groups.
Figure 13: Figure S6: Hierarchical clustering of the samples
Labels correspond to dysplasia status, while colors indicate COPD groups.
Figure 14: Figure S6: Hierarchical clustering of the samples
Labels correspond to dysplasia status, while colors indicate smoking status groups.
Figure 15: Figure S6: Hierarchical clustering of the samples
Labels correspond to dysplasia status, while colors indicate pack-years groups.
No clustering is evident in any of the resulting plots except for Smoking Status, where we can see some clustering between the two categories: Current smoker vs. Former smoker.
In Figure 16 we show the corresponding MDS plot for the Smoking Status variable.
Figure 16: Figure S7: Multidimensional scaling plot of the samples
Labels correspond to treatment and colors indicate sample group.
We can distinguish more clearly that there are indeed two clusters, one for Current smokers and another one for Former smokers. This leads us to assume these two groups have significant differences in their gene expression.
Our Exploratory Data Analysis and new Experimental Design allowed us to identify a variable that raises interesting questions when faced with the prospect of investigating it: Smoking status.
All of the samples in our dataset have either been or currently are smokers. Based on the results we’ve obtained on this first part of our project, we’ve decided to use this variable to design our new groups. We obtain: Current Smokers (n = r length(se.filt\(smoker[se.filt\)smoker == “Current smoker”])) and Former smokers (n = r length(se.filt\(smoker[se.filt\)smoker == “Former smoker”])).
We’re also interested in looking at another variable, seeing if it has statistically significant results, to be able to compare its findings to the Smoking status analysis. Based on our clustering, we have identified Dysplasia as a variable that clusters. Dysplasia, which is based on Histology level, assessing the damage in the tissue. It is relevan to see how Current vs. Former smokers differ and also which processes differentially involved in one of these groups is also related to airway tissue damage.
Our goal is to find differentially expressed gene sets and pathways between these groups. To do so, we’ll perform Differential Expression Analysis with Linear Regression models, using Bioconductor’s limma pipeline Smyth (2004) Matthew E. Ritchie (2015). The results will be used to perform Functional Analysis using the Bioconductor package GOstats (Gene Ontology), and finally we’ll perform Gene Set Enrichment Analysis to obtain our final results.
These results will help us gain an insight into How the genome changes once a person stops smoking, and wether or not pathways invoved in this are also involved in damage to airway tissue.
From hereon, every step will be done once for the Dysplasia grouping (samples with PML vs without) and then for the Smoking status grouping, to obtain our new results.
The next step is to do a Differential Expression Analysis. At first we’ll simply use fold-change values to get a first glance at differentially expressed genes, and we’ll follow that by DE through linear regression. We chose to do two separate DE Analysis, based on the following variables:
PML (Premalignant Lesions)
Smoking status
Firstly, a simple way to find differentially expressed genes is to simply compare expression levels between the two groups of samples.
We can do this using fold-change, which compares the mean read counts (using the logarithmic scale to try and stabilize variability across expression levels).
We’ll see how many genes change their expression between the two groups using two different classifications.
Our two groups of samples are: samples of Current Smokers and samples of Former Smokers. We calculate the means of the values of the expression levels.
current <- rowMeans(logCPM[, se.filt$smoker=="Current smoker"])
former <- rowMeans(logCPM[, se.filt$smoker=="Former smoker"])
We can now calculate fold-change values for each gene in these two groups. We’ll rank the genes by decreasing absolute value of Fold Change, and we’ll call differentially expressed genes those that are at the top 10.
log2fc_smoker <- current - former
ranking <- order(abs(log2fc_smoker), decreasing=TRUE)
fold_change_smoker <- data.frame(EntrezID=names(log2fc_smoker[ranking]),
Log2FC=round(log2fc_smoker[ranking], digits=3),
FC=round(2^log2fc_smoker[ranking], digits=3),
"1/FC"=round(2^(-log2fc_smoker[ranking]), digits=3),
row.names=rowData(se.filt)$symbol[ranking],
check.names=FALSE)
de_fc_smoker <- head(fold_change_smoker, n=10)
de_fc_smoker
EntrezID Log2FC FC 1/FC
AKR1B10 57016 4.732 26.584 0.038
UCHL1 7345 4.439 21.688 0.046
CYP1B1 1545 4.030 16.332 0.061
BPIFB2 80341 3.842 14.343 0.070
SPP1 6696 3.496 11.286 0.089
SLC7A11 23657 3.362 10.281 0.097
ALDH3A1 218 3.285 9.749 0.103
CABYR 26256 2.991 7.950 0.126
AKR1C2 1646 2.942 7.686 0.130
H19 283120 2.882 7.372 0.136
In the resulting Data Frame:
If a gene has a positive Log2FC value, it means said gene is FC times more upregulated in samples that are Current smokers compared to in samples that are Former smokers.
If a gene has a negative Log2FC value, it means said gene is 1/FC times more upregulated in samples that are Former smokers compared to in samples that are Current smokers.
We can clearly see the difference of expression between groups is larger with the smoking status distinction than the dysplasia one.
The two main groups we want to contrast are samples with dysplasia and samples without it. That is why our outcome target is Dysplasia.
no_dysplasia <- rowMeans(logCPM[, se.filt$dysplasia=="no dysplasia"])
dysplasia <- rowMeans(logCPM[, se.filt$dysplasia=="dysplasia"])
We can now calculate fold-change values for each gene in these two groups.
log2fc_dysplasia <- dysplasia - no_dysplasia
ranking <- order(abs(log2fc_dysplasia), decreasing=TRUE)
fold_change_dysplasia <- data.frame(EntrezID=names(log2fc_dysplasia[ranking]),
Log2FC=round(log2fc_dysplasia[ranking], digits=3),
FC=round(2^log2fc_dysplasia[ranking], digits=3),
"1/FC"=round(2^(-log2fc_dysplasia[ranking]), digits=3),
row.names=rowData(se.filt)$symbol[ranking],
check.names=FALSE)
de_fc_dysplasia <- head(fold_change_dysplasia, n=10)
de_fc_dysplasia
EntrezID Log2FC FC 1/FC
CLDN18 51208 -1.040 0.486 2.056
MROH7 374977 -1.031 0.490 2.043
AGAP12P 414224 -0.975 0.509 1.965
FAM95C 100289137 -0.910 0.532 1.880
FAM238C 387644 -0.902 0.535 1.869
AC090826.1 729739 -0.896 0.537 1.861
LRP2 4036 -0.891 0.539 1.855
CAPN3 825 -0.885 0.541 1.847
AOC3 8639 -0.875 0.545 1.834
NPIPB3 23117 -0.848 0.556 1.800
In the resulting Data Frame:
If a gene has a positive Log2FC value, it means said gene is FC times more upregulated in samples with Dysplasia compared to in samples without Dysplasia.
If a gene has a negative Log2FC value, it means said gene is 1/FC times more upregulated in patients without Dysplasia compared to in patients with Dysplasia
We will now use Linear Regression Models to perform our Differential Expression, using the limma pipeline.
library(limma)
We’ll use the same two groupings, and use the followings steps:
We will start with smoking status since the fold-change values are higher and it probably means we’ll need less adjusting during the process (and it’ll be simpler).
First, we build our design matrix.
mod <- model.matrix(~ smoker, data=colData(se.filt))
head(mod)
(Intercept) smokerFormer smoker
SRR3225255 1 1
SRR3225256 1 0
SRR3225257 1 0
SRR3225258 1 0
SRR3225259 1 1
SRR3225260 1 0
Next, we fit the linear regression model
fit <- lmFit(assays(se.filt)$logCPM, mod)
We calculate moderated t-statistics
fit <- eBayes(fit)
A quick overview of the results for FDR p-value cutoff at 5% (default)
res <- decideTests(fit)
summary(res)
(Intercept) smokerFormer smoker
Down 0 3394
NotSig 0 6125
Up 13167 3648
summary(res)[1,2] + summary(res)[3,2]
[1] 7042
We have 3394 downregulated genes, and 3648 upregulated genes, a total of 7042.
Seeing these results, we can afford to be a little more restricting with the FDR cut-off. We’ll fetch the results with a p-value cut-off of 0,001%.
res <- decideTests(fit, p.value=0.00001)
summary(res)
(Intercept) smokerFormer smoker
Down 0 805
NotSig 7 11806
Up 13160 556
summary(res)[1,2] + summary(res)[3,2]
[1] 1361
We now have 805 downregulated genes, and 556 upregulated genes, a total of 1361.
And a more detailed view of the results (we’ll be sure to add more information about the genes, so as to not only have the EntrezID identifier as rowname).
The entire table, with information about all the genes, sorted by p-value:
genesmd <- data.frame(chr=as.character(seqnames(rowRanges(se.filt))),
symbol=rowData(se.filt)[,5],
stringsAsFactors=FALSE)
gene_id=rowData(se.filt)[,1]
fit$genes <- genesmd
tt_smoker <- topTable(fit, coef=2, n=Inf)
tt_smoker <- tt_smoker[order(tt_smoker$adj.P.Val),]
dim(tt_smoker)
[1] 13167 8
head(tt_smoker)
chr symbol logFC AveExpr t P.Value adj.P.Val
218 17 ALDH3A1 -3.285348 9.983528 -11.59737 6.093143e-19 8.022842e-15
10553 11 HTATIP2 -1.015938 5.636573 -11.33322 1.960761e-18 1.037547e-14
4199 6 ME1 -2.063224 4.596722 -11.29109 2.363971e-18 1.037547e-14
26256 18 CABYR -3.002625 3.221512 -11.14008 4.627564e-18 1.315031e-14
1728 16 NQO1 -1.812183 9.108047 -11.12299 4.993663e-18 1.315031e-14
7345 4 UCHL1 -4.451732 4.192392 -10.92192 1.225474e-17 2.095782e-14
B
218 32.59231
10553 31.45628
4199 31.27444
26256 30.62121
1728 30.54715
7345 29.67372
But our goal is to get a list of DE genes, so, finally, we’ll fetch the genes called DE with FDR <0,1%
DEgenes_smoker <- rownames(tt_smoker)[tt_smoker$adj.P.Val < 0.00001]
length(DEgenes_smoker)
[1] 1361
We’ll now plot the distribution of p-values and moderated t-statistics (Figure 17)
par(mfrow=c(1,2), mar=c(4, 5, 2, 2))
hist(tt_smoker$P.Value, xlab="Raw P-values", main="", las=1)
qqt(fit$t[, 2], df=fit$df.prior+fit$df.residual, main="", pch=".", cex=3)
abline(0, 1, lwd=2)
Figure 17: Distribution of p-values and moderated t-statistics on every gene between samples with that are Current smokers vs. samples that are former smokers
Another relevant one is the volcano plot (Figure ??).
volcanoplot(fit, coef=2, highlight=7, names=fit$genes$symbol, las=1)
Figure 18: P-value diagnostic plot for differental expression between samples that are Current smokers and samples that are Former smokers
We can see from the plots our results are highly relevant and statistically significant.
Before we start, we have to exclude the metaplasia group, seeing as it was excluded in the paper and we need two groups only, dysplasia vs. no dysplasia.
se.filt <- se.filt[,se.filt$dysplasia!="metaplasia"]
se.filt$dysplasia <- factor(se.filt$dysplasia, levels = c("dysplasia", "no dysplasia"))
dim(se.filt)
[1] 13167 74
First, we build our design matrix.
mod <- model.matrix(~ dysplasia, data=colData(se.filt))
head(mod)
(Intercept) dysplasiano dysplasia
SRR3225255 1 0
SRR3225256 1 0
SRR3225257 1 0
SRR3225258 1 0
SRR3225259 1 0
SRR3225260 1 0
Next, we fit the linear regression model
fit <- lmFit(assays(se.filt)$logCPM, mod)
We calculate moderated t-statistics
fit <- eBayes(fit)
A quick overview of the results for FDR p-value cutoff at 5% (default)
res <- decideTests(fit)
summary(res)
(Intercept) dysplasiano dysplasia
Down 0 1499
NotSig 0 9950
Up 13167 1718
summary(res)[1,2] + summary(res)[3,2]
[1] 3217
We have 1499 downregulated genes, and 1718 upregulated genes, a total of 3217.
Seeing these results, we can afford to be a little more restricting with the FDR cut-off. We’ll fetch the results with a p-value cut-off of 1%.
res <- decideTests(fit, p.value=0.01)
summary(res)
(Intercept) dysplasiano dysplasia
Down 0 548
NotSig 0 11850
Up 13167 769
summary(res)[1,2] + summary(res)[3,2]
[1] 1317
We now have 548 downregulated genes, and 769 upregulated genes, a total of 1317.
And a more detailed view of the results (we’ll be sure to add more information about the genes, so as to not only have the EntrezID identifier as rowname).
The entire table, with information about all the genes, sorted by p-value:
genesmd <- data.frame(chr=as.character(seqnames(rowRanges(se.filt))),
symbol=rowData(se.filt)[,5],
stringsAsFactors=FALSE)
gene_id=rowData(se.filt)[,1]
fit$genes <- genesmd
tt_dysplasia <- topTable(fit, coef=2, n=Inf)
tt_dysplasia <- tt_dysplasia[order(tt_dysplasia$adj.P.Val),]
dim(tt_dysplasia)
[1] 13167 8
head(tt_dysplasia)
chr symbol logFC AveExpr t P.Value
26505 2 CNNM3 0.3315382 5.375215 5.865796 1.147186e-07
101927550 7 AC004893.2 0.4896133 1.522941 5.805219 1.471928e-07
23334 1 SZT2 0.4180050 5.864061 5.710929 2.165960e-07
7580 10 ZNF32 -0.2577731 3.601835 -5.622424 3.106504e-07
825 15 CAPN3 0.8951923 1.604204 5.570312 3.837861e-07
29123 16 ANKRD11 0.2243954 6.929926 5.541201 4.317646e-07
adj.P.Val B
26505 0.0008456886 7.407644
101927550 0.0008456886 7.177840
23334 0.0008456886 6.821753
7580 0.0008456886 6.489386
825 0.0008456886 6.294578
29123 0.0008456886 6.186050
But our goal is to get a list of DE genes, so, finally, we’ll fetch the genes called DE with FDR <1%
DEgenes_dysplasia <- rownames(tt_dysplasia)[tt_dysplasia$adj.P.Val < 0.01]
length(DEgenes_dysplasia)
[1] 1317
We’ll now plot the distribution of p-values and moderated t-statistics (Figure 19)
par(mfrow=c(1,2), mar=c(4, 5, 2, 2))
hist(tt_dysplasia$P.Value, xlab="Raw P-values", main="", las=1)
qqt(fit$t[, 2], df=fit$df.prior+fit$df.residual, main="", pch=".", cex=3)
abline(0, 1, lwd=2)
Figure 19: Distribution of p-values and moderated t-statistics on every gene between samples with PML vs. without PML
Another relevant one is the volcano plot (Figure ??).
volcanoplot(fit, coef=2, highlight=7, names=fit$genes$symbol, las=1)
Figure 20: P-value diagnostic plot for differental expression between samples with Dysplasia and samples without Dysplasia
We can see from the plots our results are highly relevant and statistically significant.
Now, let’s attempt to find DE pathways, as DE gene sets, by detecting the functional enrichment.
Firstly, we will attempt to do it with the Gene Ontology (GO) analysis which corresponds to applying the one-tailed Fisher’s exact test to every gene set in the GO database. To execute this, we will use the Bioconductor package GOstats.
The first step is to create our gene universe.
library(GOstats)
library(org.Hs.eg.db)
geneUniverse <- rownames(se.filt)
Let’s build the parameter object and use the conditional test which computes the significance of a GO term conditional on the significance of its children.
paramsSmokerStatus <- new("GOHyperGParams", geneIds=DEgenes_smoker,
universeGeneIds=geneUniverse,
annotation="org.Hs.eg.db", ontology="BP",
pvalueCutoff=0.05, testDirection="over")
conditional(paramsSmokerStatus) <- TRUE
The next step is to run the functional enrichment analysis.
hgOverCondSmokerStatus <- hyperGTest(paramsSmokerStatus)
hgOverCondSmokerStatus
Gene to GO BP Conditional test for over-representation
7785 GO BP ids tested (339 have p < 0.05)
Selected gene set size: 1233
Gene universe size: 11592
Annotation package: org.Hs.eg
And lastly to store and visualize the results. The resHgOverCond data.frame object contacts the results of the hypothesis test for each GO term, satisfying the P-value cutoff.
resHgOverCondSmokerStatus <- summary(hgOverCondSmokerStatus)
dim(resHgOverCondSmokerStatus)
[1] 339 7
head(resHgOverCondSmokerStatus)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0006890 1.236820e-06 4.935723 5.211957 18 49
2 GO:0006888 1.984599e-05 2.743511 12.232143 28 115
3 GO:0045454 6.446669e-05 4.588627 3.935559 13 37
4 GO:0097237 1.169632e-04 2.715806 10.104814 23 95
5 GO:0043436 2.293796e-04 1.525248 69.457298 98 653
6 GO:0042592 2.386685e-04 1.395609 121.789596 158 1145
Term
1 retrograde vesicle-mediated transport, Golgi to endoplasmic reticulum
2 endoplasmic reticulum to Golgi vesicle-mediated transport
3 cell redox homeostasis
4 cellular response to toxic substance
5 oxoacid metabolic process
6 homeostatic process
ifelse(!dir.exists(file.path("./results")),
dir.create(file.path("./results")),
"Directory Exists")
[1] "Directory Exists"
htmlReport(hgOverCondSmokerStatus, file="results/gotests_smoker_status.html")
browseURL("results/gotests_smoker_status.html")
Check out the result here.
We do not want our GO terms to be involving too few genes or too many genes. That would make them too unreliable. So let’s filter the previous results with a minimum size of 3, a maximum size of 300 and a minimum count of 3. Then we order them by the OddsRatio column.
resHgOverCondSmokerStatus <- resHgOverCondSmokerStatus[!is.infinite(resHgOverCondSmokerStatus$OddsRatio), ]
mask <- resHgOverCondSmokerStatus$Size >= 3 & resHgOverCondSmokerStatus$Size <= 300 & resHgOverCondSmokerStatus$Count >= 3
resHgOverCondSmokerStatus <- resHgOverCondSmokerStatus[mask, ]
dim(resHgOverCondSmokerStatus)
[1] 249 7
ord <- order(resHgOverCondSmokerStatus$OddsRatio, decreasing=TRUE)
resHgOverCondSmokerStatus <- resHgOverCondSmokerStatus[ord, ]
head(resHgOverCondSmokerStatus)
GOBPID Pvalue OddsRatio ExpCount Count Size
13 GO:0015827 0.0005831661 33.71196 0.5318323 4 5
40 GO:0015829 0.0044208430 25.26341 0.4254658 3 4
41 GO:0018197 0.0044208430 25.26341 0.4254658 3 4
24 GO:0016115 0.0016016227 16.85435 0.6381988 4 6
11 GO:0044597 0.0005740280 14.05537 0.8509317 5 8
12 GO:0044598 0.0005740280 14.05537 0.8509317 5 8
Term
13 tryptophan transport
40 valine transport
41 peptidyl-aspartic acid modification
24 terpenoid catabolic process
11 daunorubicin metabolic process
12 doxorubicin metabolic process
Using the geneIdsByCategory() method, we extract the genes that enrich each GO term and paste it to the result.
geneIDs <- geneIdsByCategory(hgOverCondSmokerStatus)[resHgOverCondSmokerStatus$GOBPID]
geneSYMs <- sapply(geneIDs, function(id) select(org.Hs.eg.db, columns="SYMBOL", key=id, keytype="ENTREZID")$SYMBOL)
geneSYMs <- sapply(geneSYMs, paste, collapse=", ")
goresultsSmokerStatus <- cbind(resHgOverCondSmokerStatus, Genes=geneSYMs)
rownames(goresultsSmokerStatus) <- 1:nrow(goresultsSmokerStatus)
We generate an HTML page using the knitr and kableExtra packages.
See the HTML page by clicking on this link.
Table 1 displays the top 10 Go results (sorted by Odds Ratio).
| Pvalue | OddsRatio | ExpCount | Count | Size | Term | |
|---|---|---|---|---|---|---|
| 1 | 0.0005832 | 33.71196 | 0.5318323 | 4 | 5 | tryptophan transport |
| 2 | 0.0044208 | 25.26341 | 0.4254658 | 3 | 4 | valine transport |
| 3 | 0.0044208 | 25.26341 | 0.4254658 | 3 | 4 | peptidyl-aspartic acid modification |
| 4 | 0.0016016 | 16.85435 | 0.6381988 | 4 | 6 | terpenoid catabolic process |
| 5 | 0.0005740 | 14.05537 | 0.8509317 | 5 | 8 | daunorubicin metabolic process |
| 6 | 0.0005740 | 14.05537 | 0.8509317 | 5 | 8 | doxorubicin metabolic process |
| 7 | 0.0101774 | 12.63049 | 0.5318323 | 3 | 5 | sinoatrial node development |
| 8 | 0.0101774 | 12.63049 | 0.5318323 | 3 | 5 | taurine transport |
| 9 | 0.0101774 | 12.63049 | 0.5318323 | 3 | 5 | isoleucine transport |
| 10 | 0.0101774 | 12.63049 | 0.5318323 | 3 | 5 | thyroid hormone transport |
Build the parameter object and use the conditional test.
paramsDysplasia <- new("GOHyperGParams", geneIds=DEgenes_dysplasia,
universeGeneIds=geneUniverse,
annotation="org.Hs.eg.db", ontology="BP",
pvalueCutoff=0.05, testDirection="over")
conditional(paramsDysplasia) <- TRUE
Run the functional enrichment analysis.
hgOverCondDysplasia <- hyperGTest(paramsDysplasia)
hgOverCondDysplasia
Gene to GO BP Conditional test for over-representation
6454 GO BP ids tested (157 have p < 0.05)
Selected gene set size: 1136
Gene universe size: 11592
Annotation package: org.Hs.eg
Store and visualize the results.
resHgOverCondDysplasia <- summary(hgOverCondDysplasia)
dim(resHgOverCondDysplasia)
[1] 157 7
head(resHgOverCondDysplasia)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0002181 8.862844e-50 21.497819 10.52792 74 108
2 GO:0043604 2.182958e-29 3.195909 68.79503 167 702
3 GO:0006518 5.111718e-27 3.018487 71.24500 166 727
4 GO:0034641 2.890271e-18 1.748412 419.50341 553 4328
5 GO:1901566 3.312695e-16 2.076009 115.41480 200 1195
6 GO:0022904 6.184984e-13 5.134895 10.58385 38 108
Term
1 cytoplasmic translation
2 amide biosynthetic process
3 peptide metabolic process
4 cellular nitrogen compound metabolic process
5 organonitrogen compound biosynthetic process
6 respiratory electron transport chain
htmlReport(hgOverCondDysplasia, file="./results/gotests_dysplasia.html")
browseURL("results/gotests_dysplasia.html")
Check out the result here.
Filter the previous results with a minimum size of 3, a maximum size of 300 and a minimum count of 3 and order them by the OddsRatio column.
resHgOverCondDysplasia <- resHgOverCondDysplasia[!is.infinite(resHgOverCondDysplasia$OddsRatio), ]
mask <- resHgOverCondDysplasia$Size >= 3 & resHgOverCondDysplasia$Size <= 300 & resHgOverCondDysplasia$Count >= 3
resHgOverCondDysplasia <- resHgOverCondDysplasia[mask, ]
dim(resHgOverCondDysplasia)
[1] 107 7
ord <- order(resHgOverCondDysplasia$OddsRatio, decreasing=TRUE)
resHgOverCondDysplasia <- resHgOverCondDysplasia[ord, ]
head(resHgOverCondDysplasia)
GOBPID Pvalue OddsRatio ExpCount Count Size
30 GO:0036261 2.058449e-05 27.75398 0.7839890 6 8
61 GO:2000435 3.480267e-03 27.68314 0.3919945 3 4
1 GO:0002181 8.862844e-50 21.49782 10.5279197 74 108
13 GO:0006123 5.424236e-08 17.02963 1.6659765 11 17
73 GO:2000627 8.066023e-03 13.84025 0.4899931 3 5
57 GO:0015986 2.357034e-03 12.55696 0.6739526 4 7
Term
30 7-methylguanosine cap hypermethylation
61 negative regulation of protein neddylation
1 cytoplasmic translation
13 mitochondrial electron transport, cytochrome c to oxygen
73 positive regulation of miRNA catabolic process
57 proton motive force-driven ATP synthesis
Using the geneIdsByCategory() method, we extract the genes that enrich each GO term and paste it to the result.
geneIDs <- geneIdsByCategory(hgOverCondDysplasia)[resHgOverCondDysplasia$GOBPID]
geneSYMs <- sapply(geneIDs, function(id) select(org.Hs.eg.db, columns="SYMBOL", key=id, keytype="ENTREZID")$SYMBOL)
geneSYMs <- sapply(geneSYMs, paste, collapse=", ")
goresultsDysplasia <- cbind(resHgOverCondDysplasia, Genes=geneSYMs)
rownames(goresultsDysplasia) <- 1:nrow(goresultsDysplasia)
We generate an HTML page using the knitr and kableExtra packages.
See the HTML page by clicking on this link.
Table 2 displays the top 10 Go results (sorted by Odds Ratio).
| Pvalue | OddsRatio | ExpCount | Count | Size | Term | |
|---|---|---|---|---|---|---|
| 1 | 0.0000206 | 27.753982 | 0.7839890 | 6 | 8 | 7-methylguanosine cap hypermethylation |
| 2 | 0.0034803 | 27.683142 | 0.3919945 | 3 | 4 | negative regulation of protein neddylation |
| 3 | 0.0000000 | 21.497819 | 10.5279197 | 74 | 108 | cytoplasmic translation |
| 4 | 0.0000001 | 17.029630 | 1.6659765 | 11 | 17 | mitochondrial electron transport, cytochrome c to oxygen |
| 5 | 0.0080660 | 13.840247 | 0.4899931 | 3 | 5 | positive regulation of miRNA catabolic process |
| 6 | 0.0023570 | 12.556962 | 0.6739526 | 4 | 7 | proton motive force-driven ATP synthesis |
| 7 | 0.0025191 | 12.312132 | 0.6859903 | 4 | 7 | protein maturation by [4Fe-4S] cluster transfer |
| 8 | 0.0000049 | 11.920522 | 1.5679779 | 9 | 16 | ribosomal small subunit assembly |
| 9 | 0.0000860 | 10.798642 | 1.2739821 | 7 | 13 | mitochondrial electron transport, ubiquinol to cytochrome c |
| 10 | 0.0000000 | 9.430127 | 3.4969644 | 18 | 36 | ribosomal small subunit biogenesis |
The second part of the functional analysis involves performing Gene Set Enrichment Analysis (GSEA), a computational method used to determine if predefined sets of genes show statistically significant and concordant differences between two biological states (e.g., phenotypes). In our study, we will conduct GSEA for two comparisons: identifying pathways significantly enriched between current and former smokers and between the presence and absence of premalignant lesions (PMLs).
We load the necessary libraries: fgsea for GSEA, GSVAdata for gene set data, and GSEABase for gene set collection handling. We load the c2BroadSets dataset, which contains gene sets from KEGG, REACTOME, and BIOCARTA pathways.
library(fgsea)
library(GSVAdata)
Loading required package: GSEABase
Loading required package: hgu95a.db
library(GSEABase)
data(c2BroadSets)
c2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),
+ grep("^REACTOME", names(c2BroadSets)),
+ grep("^BIOCARTA", names(c2BroadSets)))]
c2BroadSets
GeneSetCollection
names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
unique identifiers: 55902, 2645, ..., 8544 (6744 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
In order to do the GSEA anlysis we decided to compare our dataset with two more datasets from the MSigDB. One dataset represents the control group and the other the tumor group. We use the msigdbr library to access the Molecular Signatures Database (MSigDB) for the datasets. The dplyr library is used for data manipulation.
library(msigdbr)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:GSEABase':
intersect, setdiff, union
The following object is masked from 'package:graph':
union
The following object is masked from 'package:AnnotationDbi':
select
The following object is masked from 'package:Biobase':
combine
The following objects are masked from 'package:GenomicRanges':
intersect, setdiff, union
The following object is masked from 'package:GenomeInfoDb':
intersect
The following objects are masked from 'package:IRanges':
collapse, desc, intersect, setdiff, slice, union
The following objects are masked from 'package:S4Vectors':
first, intersect, rename, setdiff, setequal, union
The following objects are masked from 'package:BiocGenerics':
combine, intersect, setdiff, union
The following object is masked from 'package:matrixStats':
count
The following object is masked from 'package:kableExtra':
group_rows
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
msigdb_data <- msigdbr(species = "Homo sapiens")
str(msigdb_data)
tibble [4,331,807 × 15] (S3: tbl_df/tbl/data.frame)
$ gs_cat : chr [1:4331807] "C3" "C3" "C3" "C3" ...
$ gs_subcat : chr [1:4331807] "MIR:MIR_Legacy" "MIR:MIR_Legacy" "MIR:MIR_Legacy" "MIR:MIR_Legacy" ...
$ gs_name : chr [1:4331807] "AAACCAC_MIR140" "AAACCAC_MIR140" "AAACCAC_MIR140" "AAACCAC_MIR140" ...
$ gene_symbol : chr [1:4331807] "ABCC4" "ABRAXAS2" "ACTN4" "ACTN4" ...
$ entrez_gene : int [1:4331807] 10257 23172 81 81 90 8754 8754 11096 219287 287 ...
$ ensembl_gene : chr [1:4331807] "ENSG00000125257" "ENSG00000165660" "ENSG00000130402" "ENSG00000282844" ...
$ human_gene_symbol : chr [1:4331807] "ABCC4" "ABRAXAS2" "ACTN4" "ACTN4" ...
$ human_entrez_gene : int [1:4331807] 10257 23172 81 81 90 8754 8754 11096 219287 287 ...
$ human_ensembl_gene: chr [1:4331807] "ENSG00000125257" "ENSG00000165660" "ENSG00000130402" "ENSG00000282844" ...
$ gs_id : chr [1:4331807] "M12609" "M12609" "M12609" "M12609" ...
$ gs_pmid : chr [1:4331807] "" "" "" "" ...
$ gs_geoid : chr [1:4331807] "" "" "" "" ...
$ gs_exact_source : chr [1:4331807] "" "" "" "" ...
$ gs_url : chr [1:4331807] "" "" "" "" ...
$ gs_description : chr [1:4331807] "Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents "| __truncated__ "Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents "| __truncated__ "Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents "| __truncated__ "Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents "| __truncated__ ...
control <- msigdb_data %>% filter(gs_name == "DESCARTES_FETAL_LUNG_BRONCHIOLAR_AND_ALVEOLAR_EPITHELIAL_CELLS")
tumor <- msigdb_data %>% filter(gs_name == "DING_LUNG_CANCER_MUTATED_FREQUENTLY")
We create a custom gene set using the Entrez IDs of differentially expressed genes in smoking status. We then combine this custom gene set with the previously filtered gene sets into a single Gene Set Collection (gsc).
library(GSEABase)
control_genes <- as.character(unique(control$entrez_gene))
control_GS <- GeneSet(EntrezIdentifier("org.Hs.eg.db"), geneIds=control_genes, setName="CONTROL")
details(control_GS)
setName: CONTROL
geneIds: 55289, 114, ..., 7364 (total: 70)
geneIdType: EntrezId (org.Hs.eg.db)
collectionType: Null
setIdentifier: Emmas-Laptop.local:51879:Tue Aug 20 19:35:10 2024:1
description:
organism:
pubMedIds:
urls:
contributor:
setVersion: 0.0.1
creationDate:
tumor_genes <- as.character(unique(tumor$entrez_gene))
tumor_GS <- GeneSet(EntrezIdentifier("org.Hs.eg.db"), geneIds=tumor_genes, setName="TUMOR")
details(tumor_GS)
setName: TUMOR
geneIds: 91, 324, ..., 7157 (total: 12)
geneIdType: EntrezId (org.Hs.eg.db)
collectionType: Null
setIdentifier: Emmas-Laptop.local:51879:Tue Aug 20 19:35:10 2024:2
description:
organism:
pubMedIds:
urls:
contributor:
setVersion: 0.0.1
creationDate:
gsc_smoker <- GeneSetCollection(c(c2BroadSets, control_GS, tumor_GS))
gsets_smoker <- geneIds(gsc_smoker)
We prepare the data by converting the histology information into factors and then into integers.
histology_factors_smoker <- factor(se.filt$smoker)
histology_integers_smoker <- as.integer(histology_factors_smoker)
levels(histology_factors_smoker)
[1] "Current smoker" "Former smoker"
histology_integers_smoker
[1] 2 1 1 1 2 1 1 1 2 1 2 2 2 1 1 2 1 1 2 2 2 2 1 1 2 2 2 2 2 2 1 1 2 2 1 2 1 1
[39] 2 2 2 1 2 1 1 1 2 2 2 2 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1 2 2 1 1 2 1 1 2 2
We use the fgseaLabel function to perform GSEA by permuting phenotypes 10,000 times to compute enrichment scores and p-values for each gene set.
gseapermpheno_smoker <- fgseaLabel(gsets_smoker, assays(se.filt)$logCPM, histology_integers_smoker,10000, minSize=5, maxSize=300)
head(gseapermpheno_smoker[order(gseapermpheno_smoker$pval), ])
pathway
<char>
1: KEGG_VIBRIO_CHOLERAE_INFECTION
2: KEGG_PROTEIN_EXPORT
3: KEGG_PENTOSE_PHOSPHATE_PATHWAY
4: REACTOME_NEF_MEDIATES_DOWN_MODULATION_OF_CELL_SURFACE_RECEPTORS_BY_RECRUITING_THEM_TO_CLATHRIN_ADAPTERS
5: KEGG_GLUTATHIONE_METABOLISM
6: REACTOME_HOMOLOGOUS_RECOMBINATION_REPAIR
pval padj ES NES nMoreExtreme size leadingEdge
<num> <num> <num> <num> <int> <int> <list>
1: 0.006082009 0.998112 -0.5863669 -1.713281 30 47 10945, 5....
2: 0.006632852 0.998112 -0.7624549 -1.687242 33 23 28972, 9....
3: 0.008277493 0.998112 -0.6475517 -1.711022 41 21 6888, 70....
4: 0.013095238 0.998112 -0.6365695 -1.687739 65 18 1174, 51....
5: 0.013299433 0.998112 -0.6190362 -1.651290 67 40 2936, 28....
6: 0.019032822 0.998112 0.5912820 1.600837 97 14 3978, 96....
We prepare a ranked list of genes based on their differential expression statistics (tt_smoker$t). We then use the fgsea function to perform GSEA on this ranked list.
stats_smoker <- tt_smoker$t
names(stats_smoker) <- rownames(tt_smoker)
gseapermgsets_smoker <- fgsea(gsets_smoker, stats_smoker, minSize=5, maxSize=300)
head(gseapermgsets_smoker[order(gseapermgsets_smoker$pval), ])
pathway
<char>
1: REACTOME_INSULIN_SYNTHESIS_AND_SECRETION
2: KEGG_OXIDATIVE_PHOSPHORYLATION
3: REACTOME_TRANSLATION
4: REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT
5: REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION
6: KEGG_PARKINSONS_DISEASE
pval padj log2err ES NES size leadingEdge
<num> <num> <num> <num> <num> <int> <list>
1: 1.614771e-17 1.312809e-14 1.076868 -0.6042764 -2.832105 114 28972, 9....
2: 3.631782e-17 1.476319e-14 1.067210 -0.6002303 -2.813142 114 51606, 9....
3: 3.286682e-16 8.906909e-14 1.037696 -0.5959089 -2.785690 112 8667, 86....
4: 9.326339e-16 1.895578e-13 1.017545 -0.6093776 -2.811094 99 8667, 86....
5: 2.770603e-15 4.505000e-13 1.007318 -0.5430190 -2.627287 140 6513, 22....
6: 3.550234e-15 4.810566e-13 1.007318 -0.5804702 -2.710117 111 7345, 93....
gseapermpheno_smoker[gseapermpheno_smoker$padj < 0.01, ]
Empty data.table (0 rows and 8 cols): pathway,pval,padj,ES,NES,nMoreExtreme...
gseapermgsets_smoker[gseapermgsets_smoker$padj < 0.01, ]
pathway
<char>
1: BIOCARTA_IL7_PATHWAY
2: BIOCARTA_PROTEASOME_PATHWAY
3: BIOCARTA_SALMONELLA_PATHWAY
4: KEGG_ALZHEIMERS_DISEASE
5: KEGG_AMINO_SUGAR_AND_NUCLEOTIDE_SUGAR_METABOLISM
6: KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION
7: KEGG_GLUTATHIONE_METABOLISM
8: KEGG_HUNTINGTONS_DISEASE
9: KEGG_JAK_STAT_SIGNALING_PATHWAY
10: KEGG_LYSOSOME
11: KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450
12: KEGG_N_GLYCAN_BIOSYNTHESIS
13: KEGG_OXIDATIVE_PHOSPHORYLATION
14: KEGG_PARKINSONS_DISEASE
15: KEGG_PENTOSE_PHOSPHATE_PATHWAY
16: KEGG_PROTEASOME
17: KEGG_PROTEIN_EXPORT
18: KEGG_RIBOSOME
19: KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT
20: KEGG_VIBRIO_CHOLERAE_INFECTION
21: REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC
22: REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX
23: REACTOME_CLATHRIN_DERIVED_VESICLE_BUDDING
24: REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_
25: REACTOME_ELECTRON_TRANSPORT_CHAIN
26: REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING
27: REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS
28: REACTOME_FORMATION_OF_THE_EARLY_ELONGATION_COMPLEX
29: REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX
30: REACTOME_GENERATION_OF_SECOND_MESSENGER_MOLECULES
31: REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION
32: REACTOME_GLUTATHIONE_CONJUGATION
33: REACTOME_GOLGI_ASSOCIATED_VESICLE_BIOGENESIS
34: REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT
35: REACTOME_HIV1_TRANSCRIPTION_ELONGATION
36: REACTOME_HIV_INFECTION
37: REACTOME_INFLUENZA_LIFE_CYCLE
38: REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION
39: REACTOME_INSULIN_SYNTHESIS_AND_SECRETION
40: REACTOME_INTEGRATION_OF_ENERGY_METABOLISM
41: REACTOME_MEMBRANE_TRAFFICKING
42: REACTOME_METABOLISM_OF_AMINO_ACIDS
43: REACTOME_METABOLISM_OF_PROTEINS
44: REACTOME_M_G1_TRANSITION
45: REACTOME_NEF_MEDIATES_DOWN_MODULATION_OF_CELL_SURFACE_RECEPTORS_BY_RECRUITING_THEM_TO_CLATHRIN_ADAPTERS
46: REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE
47: REACTOME_PEPTIDE_CHAIN_ELONGATION
48: REACTOME_PHASE_II_CONJUGATION
49: REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE
50: REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT
51: REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS
52: REACTOME_REGULATION_OF_INSULIN_SECRETION
53: REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE
54: REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1
55: REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21
56: REACTOME_SIGNALING_BY_WNT
57: REACTOME_STABILIZATION_OF_P53
58: REACTOME_THE_ROLE_OF_NEF_IN_HIV1_REPLICATION_AND_DISEASE_PATHOGENESIS
59: REACTOME_TRANSCRIPTION_OF_THE_HIV_GENOME
60: REACTOME_TRANSLATION
61: REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION
62: REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G
63: REACTOME_VIRAL_MRNA_TRANSLATION
pathway
pval padj log2err ES NES size leadingEdge
<num> <num> <num> <num> <num> <int> <list>
1: 4.299426e-04 6.020500e-03 0.4984931 0.6724408 2.072147 16 3574, 20....
2: 5.063540e-04 6.639771e-03 0.4772708 -0.6447528 -2.037740 19 6185, 56....
3: 4.507973e-04 6.108303e-03 0.4984931 -0.7759773 -2.012846 10 10094, 1....
4: 2.468758e-11 1.338067e-09 0.8634154 -0.4962823 -2.396818 139 2597, 11....
5: 4.458925e-05 8.631205e-04 0.5573322 -0.5483911 -2.063953 38 2582, 27....
6: 3.365383e-04 5.261647e-03 0.4984931 0.3679708 1.800944 106 3566, 63....
7: 1.648727e-07 5.585063e-06 0.6901325 -0.6339983 -2.409353 40 2877, 29....
8: 1.193424e-10 5.707375e-09 0.8266573 -0.4660142 -2.277193 157 148327, ....
9: 4.051522e-06 9.687904e-05 0.6105269 0.4467455 2.085570 88 3566, 85....
10: 4.303598e-06 9.996643e-05 0.6105269 -0.4339797 -2.028721 112 3423, 47....
11: 1.495416e-07 5.526242e-06 0.6901325 -0.6461607 -2.401876 36 218, 164....
12: 3.573577e-04 5.319976e-03 0.4984931 -0.4985627 -1.942198 44 8704, 19....
13: 3.631782e-17 1.476319e-14 1.0672100 -0.6002303 -2.813142 114 51606, 9....
14: 3.550234e-15 4.810566e-13 1.0073180 -0.5804702 -2.710117 111 7345, 93....
15: 1.522145e-04 2.690226e-03 0.5188481 -0.6503176 -2.107616 21 6888, 70....
16: 1.677839e-06 4.400267e-05 0.6435518 -0.6010822 -2.298384 41 5717, 57....
17: 3.095880e-08 1.324711e-06 0.7195128 -0.7634091 -2.534039 23 28972, 9....
18: 6.290674e-15 7.306169e-13 0.9969862 -0.6367641 -2.783399 80 28998, 5....
19: 4.036133e-04 5.859600e-03 0.4984931 -0.5238744 -1.947320 36 10652, 6....
20: 5.677129e-07 1.709447e-05 0.6594444 -0.5811042 -2.294809 47 10945, 1....
21: 3.301747e-05 6.710800e-04 0.5573322 -0.5041317 -2.065682 56 5717, 73....
22: 3.553942e-04 5.319976e-03 0.4984931 -0.4775321 -1.914708 50 5717, 73....
23: 3.686913e-05 7.310879e-04 0.5573322 -0.4916843 -2.056021 59 2512, 24....
24: 5.558169e-05 1.050882e-03 0.5573322 -0.4952473 -2.031736 55 5717, 65....
25: 2.798918e-13 1.896267e-11 0.9325952 -0.6329125 -2.720252 71 9377, 47....
26: 3.169378e-04 5.153408e-03 0.4984931 -0.6999149 -2.040987 15 515, 514....
27: 3.497068e-14 2.584651e-12 0.9653278 -0.6129844 -2.750971 88 8667, 86....
28: 4.108636e-04 5.860213e-03 0.4984931 -0.5623485 -1.996062 29 22916, 5....
29: 3.201969e-06 7.888486e-05 0.6272567 -0.5825199 -2.269262 44 8667, 86....
30: 1.122466e-04 2.074011e-03 0.5384341 0.6091808 2.161551 26 5058, 51....
31: 2.770603e-15 4.505000e-13 1.0073180 -0.5430190 -2.627287 140 6513, 22....
32: 3.264406e-04 5.203847e-03 0.4984931 -0.7159891 -2.062425 14 4257, 27....
33: 1.713046e-04 2.963206e-03 0.5188481 -0.4930281 -1.992204 52 2512, 24....
34: 9.326339e-16 1.895578e-13 1.0175448 -0.6093776 -2.811094 99 8667, 86....
35: 1.506294e-04 2.690226e-03 0.5188481 -0.5226089 -1.966918 38 22916, 5....
36: 5.597472e-06 1.264096e-04 0.6105269 -0.3708108 -1.841777 176 1174, 51....
37: 6.691038e-06 1.431530e-04 0.6105269 -0.4085555 -1.961145 129 5436, 56....
38: 7.265374e-15 7.383437e-13 0.9865463 -0.6145290 -2.773292 92 5436, 56....
39: 1.614771e-17 1.312809e-14 1.0768682 -0.6042764 -2.832105 114 28972, 9....
40: 3.676574e-13 2.299273e-11 0.9325952 -0.4680134 -2.356694 191 6888, 70....
41: 1.569938e-07 5.549391e-06 0.6901325 -0.5164652 -2.243162 76 2512, 22....
42: 5.962737e-08 2.423853e-06 0.7049757 -0.4542933 -2.168565 125 1728, 23....
43: 3.408285e-11 1.731835e-09 0.8513391 -0.4514737 -2.276909 194 8667, 54....
44: 4.369121e-04 6.020500e-03 0.4984931 -0.4562860 -1.889273 57 5717, 73....
45: 4.648411e-04 6.195341e-03 0.4984931 -0.6477526 -2.016798 18 1174, 51....
46: 6.007326e-07 1.744270e-05 0.6594444 -0.5975739 -2.303853 42 5717, 73....
47: 2.387782e-14 1.941267e-12 0.9759947 -0.6354136 -2.773331 78 6134, 90....
48: 2.129517e-04 3.533259e-03 0.5188481 -0.5850268 -2.086252 30 4257, 27....
49: 3.599000e-04 5.319976e-03 0.4984931 -0.4277824 -1.835476 69 5717, 65....
50: 7.109836e-08 2.752522e-06 0.7049757 -0.5069179 -2.260858 87 3171, 61....
51: 2.970462e-08 1.324711e-06 0.7337620 -0.5272849 -2.319129 83 3171, 61....
52: 5.999577e-12 3.484040e-10 0.8870750 -0.4812404 -2.384412 171 6513, 22....
53: 2.548378e-07 8.287327e-06 0.6749629 -0.6033810 -2.354751 46 1728, 57....
54: 2.199731e-06 5.588691e-05 0.6272567 -0.5726409 -2.260105 48 5717, 65....
55: 2.075988e-05 4.327636e-04 0.5756103 -0.5266596 -2.111689 50 5717, 65....
56: 6.109282e-06 1.342391e-04 0.6105269 -0.5196133 -2.156391 58 5528, 57....
57: 7.424768e-07 2.081495e-05 0.6594444 -0.5914674 -2.308257 46 5717, 73....
58: 1.821620e-04 3.085369e-03 0.5188481 -0.6221308 -2.091324 24 1174, 51....
59: 6.603218e-04 8.521296e-03 0.4772708 -0.4505919 -1.846302 56 6884, 68....
60: 3.286682e-16 8.906909e-14 1.0376962 -0.5959089 -2.785690 112 8667, 86....
61: 3.872432e-07 1.210880e-05 0.6749629 -0.5771814 -2.322203 51 8667, 86....
62: 8.305533e-07 2.250800e-05 0.6594444 -0.5760275 -2.274761 47 5717, 73....
63: 2.348315e-14 1.941267e-12 0.9759947 -0.6356636 -2.774422 78 5611, 61....
pval padj log2err ES NES size leadingEdge
significant_pathways_gsets_smoker <- gseapermgsets_smoker[gseapermgsets_smoker$padj < 0.01, ]
significant_pathways_gsets_ordered_smoker <- significant_pathways_gsets_smoker[order(significant_pathways_gsets_smoker$padj), ]
head(significant_pathways_gsets_ordered_smoker, n = 10)
pathway
<char>
1: REACTOME_INSULIN_SYNTHESIS_AND_SECRETION
2: KEGG_OXIDATIVE_PHOSPHORYLATION
3: REACTOME_TRANSLATION
4: REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT
5: REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION
6: KEGG_PARKINSONS_DISEASE
7: KEGG_RIBOSOME
8: REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION
9: REACTOME_PEPTIDE_CHAIN_ELONGATION
10: REACTOME_VIRAL_MRNA_TRANSLATION
pval padj log2err ES NES size leadingEdge
<num> <num> <num> <num> <num> <int> <list>
1: 1.614771e-17 1.312809e-14 1.0768682 -0.6042764 -2.832105 114 28972, 9....
2: 3.631782e-17 1.476319e-14 1.0672100 -0.6002303 -2.813142 114 51606, 9....
3: 3.286682e-16 8.906909e-14 1.0376962 -0.5959089 -2.785690 112 8667, 86....
4: 9.326339e-16 1.895578e-13 1.0175448 -0.6093776 -2.811094 99 8667, 86....
5: 2.770603e-15 4.505000e-13 1.0073180 -0.5430190 -2.627287 140 6513, 22....
6: 3.550234e-15 4.810566e-13 1.0073180 -0.5804702 -2.710117 111 7345, 93....
7: 6.290674e-15 7.306169e-13 0.9969862 -0.6367641 -2.783399 80 28998, 5....
8: 7.265374e-15 7.383437e-13 0.9865463 -0.6145290 -2.773292 92 5436, 56....
9: 2.387782e-14 1.941267e-12 0.9759947 -0.6354136 -2.773331 78 6134, 90....
10: 2.348315e-14 1.941267e-12 0.9759947 -0.6356636 -2.774422 78 5611, 61....
We create histograms of p-values from both GSEA methods (permuting phenotypes and permuting gene sets) to visualize the distribution of p-values. We also generate an enrichment plot for the custom gene set DE_FC_SMOKER to show the enrichment score across the ranked gene list.
par(mfrow=c(1, 2))
hist(gseapermpheno_smoker$pval, xlab="P-value", main="Permuting phenotypes", las=1)
hist(gseapermgsets_smoker$pval, xlab="P-value", main="Permuting gene sets", las=1)
plotEnrichment(gsets_smoker$CONTROL, stats_smoker)
plotEnrichment(gsets_smoker$TUMOR, stats_smoker)
We use this part of the code to identify pathways that are significantly enriched (adjusted p-value < 0.01). The plotGseaTable function then generates a table and plot of GSEA results for these significant pathways, showing the enrichment scores and p-values. This helps us visualize and interpret the most relevant pathways in our smoking status analysis.
DEpwys <- gseapermgsets_smoker$pathway[gseapermgsets_smoker$padj < 0.01]
plotGseaTable(gsets_smoker[DEpwys], stats_smoker, gseapermgsets_smoker, gseaParam=0.5)
We generate an HTML page so we can visualize more easily.
library(kableExtra)
gsea_tab_smoker <- kable(significant_pathways_gsets_ordered_smoker, "html", caption="GSEA Results Smoker")
gsea_tab_smoker <- kable_styling(gsea_tab_smoker, bootstrap_options=c("stripped", "hover", "responsive"), fixed_thead=TRUE)
save_kable(gsea_tab_smoker, file="./results/gsea_results_smoker.html", self_contained=TRUE)
See the HTML page by clicking on this link.
We use the limma::camera function to perform an additional gene set analysis that adjusts for inter-gene correlation within gene sets. This step enhances the robustness of our findings by providing an alternative method to validate the significant pathways identified by GSEA. The camera function uses the same gene sets and design matrix as our previous analyses.
gsetidx_smoker <- ids2indices(geneIds(gsc_smoker), rownames(se.filt))
camerares_smoker <- camera(y=assays(se.filt)$logCPM, index=gsetidx_smoker, design=mod, contrast=2)
head(camerares_smoker, 5)
NGenes Direction
KEGG_RIBOSOME 80 Down
REACTOME_PEPTIDE_CHAIN_ELONGATION 78 Down
REACTOME_VIRAL_MRNA_TRANSLATION 78 Down
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 88 Down
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 92 Down
PValue
KEGG_RIBOSOME 2.850098e-41
REACTOME_PEPTIDE_CHAIN_ELONGATION 3.820858e-41
REACTOME_VIRAL_MRNA_TRANSLATION 1.309456e-40
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 1.886464e-39
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 2.731443e-38
FDR
KEGG_RIBOSOME 1.591387e-38
REACTOME_PEPTIDE_CHAIN_ELONGATION 1.591387e-38
REACTOME_VIRAL_MRNA_TRANSLATION 3.635921e-38
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 3.928561e-37
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 4.550585e-36
We filter the Camera results to focus on pathways with an FDR < 0.01, ensuring we only consider the most significant pathways. This step helps us further refine our list of relevant pathways.
camerares_smoker[camerares_smoker$FDR < 0.01, ]
NGenes
KEGG_RIBOSOME 80
REACTOME_PEPTIDE_CHAIN_ELONGATION 78
REACTOME_VIRAL_MRNA_TRANSLATION 78
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 88
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 92
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS 83
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT 99
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 114
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT 87
REACTOME_TRANSLATION 112
REACTOME_INFLUENZA_LIFE_CYCLE 129
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX 44
REACTOME_DIABETES_PATHWAYS 312
REACTOME_METABOLISM_OF_PROTEINS 194
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION 51
KEGG_PARKINSONS_DISEASE 111
KEGG_OXIDATIVE_PHOSPHORYLATION 114
REACTOME_ELECTRON_TRANSPORT_CHAIN 71
KEGG_ALZHEIMERS_DISEASE 139
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION 140
KEGG_HUNTINGTONS_DISEASE 157
REACTOME_REGULATION_OF_INSULIN_SECRETION 171
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G 47
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING 15
KEGG_PROTEASOME 41
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE 42
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE 46
REACTOME_STABILIZATION_OF_P53 46
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ 55
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1 48
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM 191
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 50
REACTOME_SIGNALING_BY_WNT 58
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC 21
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A 61
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX 50
REACTOME_GENE_EXPRESSION 397
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC 56
REACTOME_M_G1_TRANSITION 57
REACTOME_G1_S_TRANSITION 88
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE 69
KEGG_PROTEIN_EXPORT 23
BIOCARTA_ETC_PATHWAY 12
REACTOME_DNA_REPLICATION_PRE_INITIATION 68
BIOCARTA_RAB_PATHWAY 10
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN 61
REACTOME_METABOLISM_OF_AMINO_ACIDS 125
REACTOME_CELL_CYCLE_CHECKPOINTS 97
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC 15
REACTOME_S_PHASE 93
REACTOME_HIV_INFECTION 176
REACTOME_SYNTHESIS_OF_DNA 82
BIOCARTA_PROTEASOME_PATHWAY 19
KEGG_CARDIAC_MUSCLE_CONTRACTION 46
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS 116
REACTOME_MRNA_SPLICING_MINOR_PATHWAY 40
BIOCARTA_SALMONELLA_PATHWAY 10
BIOCARTA_CDC42RAC_PATHWAY 14
REACTOME_CHOLESTEROL_BIOSYNTHESIS 19
REACTOME_MITOTIC_M_M_G1_PHASES 137
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS 12
REACTOME_HIV1_TRANSCRIPTION_INITIATION 36
KEGG_CITRATE_CYCLE_TCA_CYCLE 27
Direction
KEGG_RIBOSOME Down
REACTOME_PEPTIDE_CHAIN_ELONGATION Down
REACTOME_VIRAL_MRNA_TRANSLATION Down
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS Down
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION Down
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS Down
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT Down
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION Down
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT Down
REACTOME_TRANSLATION Down
REACTOME_INFLUENZA_LIFE_CYCLE Down
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX Down
REACTOME_DIABETES_PATHWAYS Down
REACTOME_METABOLISM_OF_PROTEINS Down
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION Down
KEGG_PARKINSONS_DISEASE Down
KEGG_OXIDATIVE_PHOSPHORYLATION Down
REACTOME_ELECTRON_TRANSPORT_CHAIN Down
KEGG_ALZHEIMERS_DISEASE Down
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION Down
KEGG_HUNTINGTONS_DISEASE Down
REACTOME_REGULATION_OF_INSULIN_SECRETION Down
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G Down
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING Down
KEGG_PROTEASOME Down
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE Down
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE Down
REACTOME_STABILIZATION_OF_P53 Down
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ Down
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1 Down
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM Down
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 Down
REACTOME_SIGNALING_BY_WNT Down
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC Down
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A Down
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX Down
REACTOME_GENE_EXPRESSION Down
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC Down
REACTOME_M_G1_TRANSITION Down
REACTOME_G1_S_TRANSITION Down
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE Down
KEGG_PROTEIN_EXPORT Down
BIOCARTA_ETC_PATHWAY Down
REACTOME_DNA_REPLICATION_PRE_INITIATION Down
BIOCARTA_RAB_PATHWAY Down
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN Down
REACTOME_METABOLISM_OF_AMINO_ACIDS Down
REACTOME_CELL_CYCLE_CHECKPOINTS Down
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC Down
REACTOME_S_PHASE Down
REACTOME_HIV_INFECTION Down
REACTOME_SYNTHESIS_OF_DNA Down
BIOCARTA_PROTEASOME_PATHWAY Down
KEGG_CARDIAC_MUSCLE_CONTRACTION Down
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS Down
REACTOME_MRNA_SPLICING_MINOR_PATHWAY Down
BIOCARTA_SALMONELLA_PATHWAY Down
BIOCARTA_CDC42RAC_PATHWAY Down
REACTOME_CHOLESTEROL_BIOSYNTHESIS Down
REACTOME_MITOTIC_M_M_G1_PHASES Down
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS Down
REACTOME_HIV1_TRANSCRIPTION_INITIATION Down
KEGG_CITRATE_CYCLE_TCA_CYCLE Down
PValue
KEGG_RIBOSOME 2.850098e-41
REACTOME_PEPTIDE_CHAIN_ELONGATION 3.820858e-41
REACTOME_VIRAL_MRNA_TRANSLATION 1.309456e-40
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 1.886464e-39
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 2.731443e-38
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS 1.145772e-35
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT 1.225223e-35
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 1.019429e-33
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT 2.283164e-33
REACTOME_TRANSLATION 1.171285e-32
REACTOME_INFLUENZA_LIFE_CYCLE 3.033430e-22
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX 5.392902e-22
REACTOME_DIABETES_PATHWAYS 8.692926e-22
REACTOME_METABOLISM_OF_PROTEINS 5.349591e-20
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION 9.395454e-20
KEGG_PARKINSONS_DISEASE 1.839192e-15
KEGG_OXIDATIVE_PHOSPHORYLATION 3.491358e-15
REACTOME_ELECTRON_TRANSPORT_CHAIN 5.005995e-15
KEGG_ALZHEIMERS_DISEASE 3.306168e-12
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION 4.568293e-11
KEGG_HUNTINGTONS_DISEASE 3.248647e-10
REACTOME_REGULATION_OF_INSULIN_SECRETION 1.611900e-09
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G 2.700881e-09
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING 1.052427e-08
KEGG_PROTEASOME 1.058222e-08
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE 3.144120e-08
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE 3.740179e-08
REACTOME_STABILIZATION_OF_P53 6.522039e-08
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ 7.748487e-08
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1 8.120236e-08
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM 9.625664e-08
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 1.853028e-07
REACTOME_SIGNALING_BY_WNT 1.932560e-07
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC 3.724800e-07
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A 4.225662e-07
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX 5.652607e-07
REACTOME_GENE_EXPRESSION 6.425514e-07
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC 7.186981e-07
REACTOME_M_G1_TRANSITION 1.059953e-06
REACTOME_G1_S_TRANSITION 1.682678e-06
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE 1.857432e-06
KEGG_PROTEIN_EXPORT 2.806326e-06
BIOCARTA_ETC_PATHWAY 3.925179e-06
REACTOME_DNA_REPLICATION_PRE_INITIATION 4.241129e-06
BIOCARTA_RAB_PATHWAY 7.089697e-06
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN 7.910600e-06
REACTOME_METABOLISM_OF_AMINO_ACIDS 1.757114e-05
REACTOME_CELL_CYCLE_CHECKPOINTS 1.821139e-05
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC 2.301368e-05
REACTOME_S_PHASE 2.778510e-05
REACTOME_HIV_INFECTION 3.533418e-05
REACTOME_SYNTHESIS_OF_DNA 5.736754e-05
BIOCARTA_PROTEASOME_PATHWAY 1.466427e-04
KEGG_CARDIAC_MUSCLE_CONTRACTION 1.928235e-04
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS 2.272538e-04
REACTOME_MRNA_SPLICING_MINOR_PATHWAY 2.721551e-04
BIOCARTA_SALMONELLA_PATHWAY 2.757882e-04
BIOCARTA_CDC42RAC_PATHWAY 3.143116e-04
REACTOME_CHOLESTEROL_BIOSYNTHESIS 4.647822e-04
REACTOME_MITOTIC_M_M_G1_PHASES 4.659876e-04
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS 4.930346e-04
REACTOME_HIV1_TRANSCRIPTION_INITIATION 4.994264e-04
KEGG_CITRATE_CYCLE_TCA_CYCLE 5.506234e-04
FDR
KEGG_RIBOSOME 1.591387e-38
REACTOME_PEPTIDE_CHAIN_ELONGATION 1.591387e-38
REACTOME_VIRAL_MRNA_TRANSLATION 3.635921e-38
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 3.928561e-37
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 4.550585e-36
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS 1.458015e-33
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT 1.458015e-33
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 1.061480e-31
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT 2.113196e-31
REACTOME_TRANSLATION 9.756803e-31
REACTOME_INFLUENZA_LIFE_CYCLE 2.297134e-20
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX 3.743573e-20
REACTOME_DIABETES_PATHWAYS 5.570160e-20
REACTOME_METABOLISM_OF_PROTEINS 3.183007e-18
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION 5.217609e-18
KEGG_PARKINSONS_DISEASE 9.575296e-14
KEGG_OXIDATIVE_PHOSPHORYLATION 1.710766e-13
REACTOME_ELECTRON_TRANSPORT_CHAIN 2.316663e-13
KEGG_ALZHEIMERS_DISEASE 1.449494e-10
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION 1.902694e-09
KEGG_HUNTINGTONS_DISEASE 1.288630e-08
REACTOME_REGULATION_OF_INSULIN_SECRETION 6.103241e-08
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G 9.781886e-08
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING 3.525994e-07
KEGG_PROTEASOME 3.525994e-07
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE 1.007328e-06
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE 1.153915e-06
REACTOME_STABILIZATION_OF_P53 1.940307e-06
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ 2.225686e-06
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1 2.254719e-06
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM 2.586509e-06
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 4.823664e-06
REACTOME_SIGNALING_BY_WNT 4.878251e-06
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC 9.125760e-06
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A 1.005707e-05
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX 1.307950e-05
REACTOME_GENE_EXPRESSION 1.446609e-05
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC 1.575462e-05
REACTOME_M_G1_TRANSITION 2.263951e-05
REACTOME_G1_S_TRANSITION 3.504176e-05
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE 3.773759e-05
KEGG_PROTEIN_EXPORT 5.565880e-05
BIOCARTA_ETC_PATHWAY 7.603893e-05
REACTOME_DNA_REPLICATION_PRE_INITIATION 8.029228e-05
BIOCARTA_RAB_PATHWAY 1.312382e-04
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN 1.432507e-04
REACTOME_METABOLISM_OF_AMINO_ACIDS 3.114204e-04
REACTOME_CELL_CYCLE_CHECKPOINTS 3.160435e-04
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC 3.912325e-04
REACTOME_S_PHASE 4.628998e-04
REACTOME_HIV_INFECTION 5.771249e-04
REACTOME_SYNTHESIS_OF_DNA 9.189839e-04
BIOCARTA_PROTEASOME_PATHWAY 2.304781e-03
KEGG_CARDIAC_MUSCLE_CONTRACTION 2.974481e-03
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS 3.441862e-03
REACTOME_MRNA_SPLICING_MINOR_PATHWAY 4.030378e-03
BIOCARTA_SALMONELLA_PATHWAY 4.030378e-03
BIOCARTA_CDC42RAC_PATHWAY 4.514165e-03
REACTOME_CHOLESTEROL_BIOSYNTHESIS 6.469461e-03
REACTOME_MITOTIC_M_M_G1_PHASES 6.469461e-03
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS 6.710035e-03
REACTOME_HIV1_TRANSCRIPTION_INITIATION 6.710035e-03
KEGG_CITRATE_CYCLE_TCA_CYCLE 7.280464e-03
We create a histogram of the p-values from the Camera results to visualize the distribution of significance levels across all tested pathways. This plot provides an overview of the significance of the pathways in relation to dysplasia.
hist(camerares_smoker$PValue, xlab="P-value", main="Camera", las=1)
We do the same steps as previously but now for the Dysplasia. Load the required libraries and data.
library(fgsea)
library(GSVAdata)
library(GSEABase)
data(c2BroadSets)
c2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),
+ grep("^REACTOME", names(c2BroadSets)),
+ grep("^BIOCARTA", names(c2BroadSets)))]
c2BroadSets
GeneSetCollection
names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
unique identifiers: 55902, 2645, ..., 8544 (6744 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
Create a costum gene set for Dysplasia.
library(GSEABase)
control_genes <- as.character(unique(control$entrez_gene))
control_GS <- GeneSet(EntrezIdentifier("org.Hs.eg.db"), geneIds=control_genes, setName="CONTROL")
details(control_GS)
setName: CONTROL
geneIds: 55289, 114, ..., 7364 (total: 70)
geneIdType: EntrezId (org.Hs.eg.db)
collectionType: Null
setIdentifier: Emmas-Laptop.local:51879:Tue Aug 20 19:35:33 2024:3
description:
organism:
pubMedIds:
urls:
contributor:
setVersion: 0.0.1
creationDate:
tumor_genes <- as.character(unique(tumor$entrez_gene))
tumor_GS <- GeneSet(EntrezIdentifier("org.Hs.eg.db"), geneIds=tumor_genes, setName="TUMOR")
details(tumor_GS)
setName: TUMOR
geneIds: 91, 324, ..., 7157 (total: 12)
geneIdType: EntrezId (org.Hs.eg.db)
collectionType: Null
setIdentifier: Emmas-Laptop.local:51879:Tue Aug 20 19:35:33 2024:4
description:
organism:
pubMedIds:
urls:
contributor:
setVersion: 0.0.1
creationDate:
gsc_dysplasia <- GeneSetCollection(c(c2BroadSets, control_GS, tumor_GS))
gsets_dysplasia<- geneIds(gsc_dysplasia)
Prepare the data.
histology_factors_dysplasia <- factor(se.filt$dysplasia)
histology_integers_dysplasia <- as.integer(histology_factors_dysplasia)
levels(histology_factors_dysplasia)
[1] "dysplasia" "no dysplasia"
histology_integers_dysplasia
[1] 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 2 2 2 2 1 1 2 2 2 2 1 1 1 1 2 1 1 2 2 2 1
[39] 1 1 1 2 2 1 1 2 2 1 2 1 1 1 2 2 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1
Run the GSEA by permuting phenotypes.
gseapermpheno_dysplasia <- fgseaLabel(gsets_dysplasia, assays(se.filt)$logCPM, histology_integers_dysplasia,10000, minSize=5, maxSize=300)
head(gseapermpheno_dysplasia[order(gseapermpheno_dysplasia$pval)])
pathway
<char>
1: REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING
2: KEGG_PARKINSONS_DISEASE
3: KEGG_ALZHEIMERS_DISEASE
4: REACTOME_REGULATION_OF_INSULIN_SECRETION
5: KEGG_HUNTINGTONS_DISEASE
6: REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX
pval padj ES NES nMoreExtreme size leadingEdge
<num> <num> <num> <num> <int> <int> <list>
1: 0.0001966955 0.1599135 -0.9322777 -1.778352 0 15 506, 106....
2: 0.0005896226 0.1607991 -0.7284616 -1.862856 2 111 506, 471....
3: 0.0007889546 0.1607991 -0.6349335 -1.910018 3 139 506, 471....
4: 0.0007911392 0.1607991 -0.5998803 -1.829791 3 171 506, 471....
5: 0.0013853157 0.1766344 -0.6018486 -1.879833 6 157 506, 471....
6: 0.0015822785 0.1766344 -0.8644775 -1.788427 7 44 6217, 21....
Run GSEA with the ranked gene list.
stats_dysplasia <- tt_dysplasia$t
names(stats_dysplasia) <- rownames(tt_dysplasia)
gseapermgsets_dysplasia <- fgsea(gsets_dysplasia, stats_dysplasia, minSize=5, maxSize=300)
Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
minSize, : For some pathways, in reality P-values are less than 1e-50. You can
set the `eps` argument to zero for better estimation.
head(gseapermgsets_dysplasia[order(gseapermgsets_dysplasia$pval), ])
pathway pval
<char> <num>
1: KEGG_RIBOSOME 1e-50
2: REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 1e-50
3: REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT 1e-50
4: REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 1e-50
5: REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 1e-50
6: REACTOME_PEPTIDE_CHAIN_ELONGATION 1e-50
padj log2err ES NES size leadingEdge
<num> <num> <num> <num> <int> <list>
1: 8.13e-49 NA -0.9313928 -4.274983 80 6170, 47....
2: 8.13e-49 NA -0.9007982 -4.235589 88 6170, 47....
3: 8.13e-49 NA -0.8774321 -4.197048 99 6170, 47....
4: 8.13e-49 NA -0.8692742 -4.102250 92 6170, 47....
5: 8.13e-49 NA -0.8104961 -3.979237 114 6170, 47....
6: 8.13e-49 NA -0.9278116 -4.237599 78 6170, 47....
gseapermpheno_dysplasia[gseapermpheno_dysplasia$padj < 0.01, ]
Empty data.table (0 rows and 8 cols): pathway,pval,padj,ES,NES,nMoreExtreme...
gseapermgsets_dysplasia[gseapermgsets_dysplasia$padj < 0.01, ]
pathway
<char>
1: BIOCARTA_ACTINY_PATHWAY
2: BIOCARTA_CDC42RAC_PATHWAY
3: BIOCARTA_CHREBP2_PATHWAY
4: BIOCARTA_ETC_PATHWAY
5: BIOCARTA_MPR_PATHWAY
6: BIOCARTA_PROTEASOME_PATHWAY
7: BIOCARTA_RAB_PATHWAY
8: BIOCARTA_SALMONELLA_PATHWAY
9: BIOCARTA_SARS_PATHWAY
10: KEGG_ALZHEIMERS_DISEASE
11: KEGG_AMYOTROPHIC_LATERAL_SCLEROSIS_ALS
12: KEGG_ARGININE_AND_PROLINE_METABOLISM
13: KEGG_CARDIAC_MUSCLE_CONTRACTION
14: KEGG_CITRATE_CYCLE_TCA_CYCLE
15: KEGG_HUNTINGTONS_DISEASE
16: KEGG_N_GLYCAN_BIOSYNTHESIS
17: KEGG_OXIDATIVE_PHOSPHORYLATION
18: KEGG_PARKINSONS_DISEASE
19: KEGG_PATHOGENIC_ESCHERICHIA_COLI_INFECTION
20: KEGG_PROTEASOME
21: KEGG_PROTEIN_EXPORT
22: KEGG_PURINE_METABOLISM
23: KEGG_RIBOSOME
24: KEGG_SPLICEOSOME
25: KEGG_STEROID_BIOSYNTHESIS
26: REACTOME_ADP_SIGNALLING_THROUGH_P2Y_PURINOCEPTOR_1
27: REACTOME_APOPTOSIS
28: REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC
29: REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A
30: REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX
31: REACTOME_CELL_CYCLE_CHECKPOINTS
32: REACTOME_CELL_CYCLE_MITOTIC
33: REACTOME_CHOLESTEROL_BIOSYNTHESIS
34: REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_
35: REACTOME_CYTOSOLIC_TRNA_AMINOACYLATION
36: REACTOME_DARPP32_EVENTS
37: REACTOME_DNA_REPLICATION_PRE_INITIATION
38: REACTOME_DUAL_INCISION_REACTION_IN_TC_NER
39: REACTOME_ELECTRON_TRANSPORT_CHAIN
40: REACTOME_ELONGATION_AND_PROCESSING_OF_CAPPED_TRANSCRIPTS
41: REACTOME_FORMATION_AND_MATURATION_OF_MRNA_TRANSCRIPT
42: REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING
43: REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS
44: REACTOME_FORMATION_OF_THE_EARLY_ELONGATION_COMPLEX
45: REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX
46: REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC
47: REACTOME_G1_S_TRANSITION
48: REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION
49: REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT
50: REACTOME_HIV1_TRANSCRIPTION_ELONGATION
51: REACTOME_HIV1_TRANSCRIPTION_INITIATION
52: REACTOME_HIV_INFECTION
53: REACTOME_HIV_LIFE_CYCLE
54: REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS
55: REACTOME_INFLUENZA_LIFE_CYCLE
56: REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION
57: REACTOME_INSULIN_SYNTHESIS_AND_SECRETION
58: REACTOME_INTEGRATION_OF_ENERGY_METABOLISM
59: REACTOME_MEMBRANE_TRAFFICKING
60: REACTOME_METABOLISM_OF_AMINO_ACIDS
61: REACTOME_METABOLISM_OF_PROTEINS
62: REACTOME_MITOTIC_M_M_G1_PHASES
63: REACTOME_MRNA_SPLICING
64: REACTOME_MRNA_SPLICING_MINOR_PATHWAY
65: REACTOME_M_G1_TRANSITION
66: REACTOME_ORC1_REMOVAL_FROM_CHROMATIN
67: REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE
68: REACTOME_PEPTIDE_CHAIN_ELONGATION
69: REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC
70: REACTOME_PROCESSING_OF_CAPPED_INTRON_CONTAINING_PRE_MRNA
71: REACTOME_PYRUVATE_METABOLISM_AND_TCA_CYCLE
72: REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE
73: REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT
74: REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS
75: REACTOME_REGULATION_OF_INSULIN_SECRETION
76: REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE
77: REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1
78: REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21
79: REACTOME_SIGNALING_BY_WNT
80: REACTOME_STABILIZATION_OF_P53
81: REACTOME_SYNTHESIS_OF_DNA
82: REACTOME_S_PHASE
83: REACTOME_TRANSCRIPTION_COUPLED_NER
84: REACTOME_TRANSCRIPTION_OF_THE_HIV_GENOME
85: REACTOME_TRANSLATION
86: REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION
87: REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G
88: REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS
89: REACTOME_VIRAL_MRNA_TRANSLATION
pathway
pval padj log2err ES NES size leadingEdge
<num> <num> <num> <num> <num> <int> <list>
1: 3.336223e-04 3.562039e-03 0.4984931 -0.6691403 -2.081378 16 10094, 5....
2: 6.248539e-05 8.063591e-04 0.5384341 -0.7276557 -2.178454 14 10094, 3....
3: 2.043628e-04 2.373528e-03 0.5188481 -0.5196364 -1.979624 35 10971, 5....
4: 1.272674e-06 1.989777e-05 0.6435518 -0.8327598 -2.400357 12 54205, 6....
5: 7.092562e-04 6.851226e-03 0.4772708 -0.5515447 -1.959232 26 10094, 1....
6: 1.173923e-04 1.424476e-03 0.5384341 -0.6608645 -2.138371 19 5689, 61....
7: 4.558323e-07 7.563095e-06 0.6749629 -0.8874452 -2.348127 10 5868, 58....
8: 3.066344e-04 3.368834e-03 0.4984931 -0.7798890 -2.063540 10 10094, 5....
9: 6.315221e-04 6.185873e-03 0.4772708 -0.7981492 -1.954955 8 2091, 19....
10: 4.098740e-26 1.851264e-24 1.3267161 -0.6401202 -3.256527 139 506, 542....
11: 3.352730e-04 3.562039e-03 0.4984931 -0.4998801 -1.950632 39 54205, 5....
12: 1.601275e-04 1.886719e-03 0.5188481 -0.5216940 -2.024517 38 4942, 44....
13: 1.271430e-08 3.040214e-07 0.7477397 -0.6234644 -2.556679 46 1329, 73....
14: 1.454147e-05 2.149494e-04 0.5933255 -0.6251585 -2.224068 27 8802, 63....
15: 1.447183e-25 6.192421e-24 1.3110148 -0.6075394 -3.135642 157 506, 542....
16: 3.736896e-05 5.149316e-04 0.5573322 -0.5175880 -2.089971 44 6184, 88....
17: 5.011556e-35 3.134150e-33 1.5364590 -0.7326132 -3.596861 114 506, 471....
18: 4.738118e-33 2.751493e-31 1.4954793 -0.7322581 -3.592772 111 506, 542....
19: 3.373641e-04 3.562039e-03 0.4984931 -0.4883693 -1.951885 42 10971, 1....
20: 1.173811e-08 2.891845e-07 0.7477397 -0.6432409 -2.555135 41 5689, 56....
21: 1.457398e-04 1.742448e-03 0.5188481 -0.6288415 -2.125164 23 6731, 10....
22: 9.254324e-05 1.157502e-03 0.5384341 -0.3677763 -1.815030 120 5634, 54....
23: 1.000000e-50 8.130000e-49 NA -0.9313928 -4.274983 80 6170, 47....
24: 2.339411e-04 2.641585e-03 0.5188481 -0.3487875 -1.737921 124 6632, 66....
25: 5.073252e-04 5.092042e-03 0.4772708 -0.6809386 -2.038593 14 6309, 22....
26: 9.039515e-04 8.447271e-03 0.4772708 -0.6442683 -1.930484 15 2769, 55....
27: 2.148243e-05 3.064073e-04 0.5756103 -0.3868708 -1.900958 113 54205, 5....
28: 1.124826e-07 2.078372e-06 0.7049757 -0.5586646 -2.404575 56 7324, 56....
29: 1.032019e-07 1.951236e-06 0.7049757 -0.5520804 -2.420206 61 7324, 56....
30: 9.470563e-08 1.833230e-06 0.7049757 -0.5797406 -2.441094 50 5689, 57....
31: 8.910281e-09 2.336793e-07 0.7477397 -0.4892937 -2.330576 97 7324, 56....
32: 5.560917e-04 5.513446e-03 0.4772708 -0.2775420 -1.544275 268 6232, 73....
33: 9.957026e-05 1.226525e-03 0.5384341 -0.6654635 -2.153252 19 6309, 10....
34: 4.637012e-09 1.396256e-07 0.7614608 -0.5979572 -2.560842 55 5689, 57....
35: 5.052906e-04 5.092042e-03 0.4772708 -0.5992474 -2.025151 23 9255, 16....
36: 9.021179e-04 8.447271e-03 0.4772708 -0.5545511 -1.913469 24 5515, 10....
37: 1.793391e-08 3.940613e-07 0.7337620 -0.5346576 -2.391126 68 5689, 57....
38: 2.940978e-05 4.122439e-04 0.5756103 -0.6277078 -2.203301 25 5440, 54....
39: 4.267074e-27 2.040665e-25 1.3499257 -0.7782906 -3.505758 71 54205, 4....
40: 1.046205e-05 1.575119e-04 0.5933255 -0.3753481 -1.885063 130 6632, 66....
41: 8.571122e-06 1.314778e-04 0.5933255 -0.3650816 -1.876195 148 2958, 66....
42: 1.060214e-13 3.747626e-12 0.9545416 -0.9350571 -2.801803 15 506, 106....
43: 1.000000e-50 8.130000e-49 NA -0.9007982 -4.235589 88 6170, 47....
44: 4.751831e-04 4.890175e-03 0.4984931 -0.5513123 -1.984757 29 5440, 29....
45: 2.451075e-25 9.489161e-24 1.3030932 -0.8642115 -3.489603 44 6217, 21....
46: 3.835032e-05 5.196469e-04 0.5573322 -0.7199355 -2.157213 15 10694, 9....
47: 5.428916e-09 1.576325e-07 0.7614608 -0.5161450 -2.426935 88 5689, 57....
48: 3.582757e-29 1.941854e-27 1.4025887 -0.6559124 -3.335768 140 506, 542....
49: 1.000000e-50 8.130000e-49 NA -0.8774321 -4.197048 99 6170, 47....
50: 2.455509e-04 2.734697e-03 0.4984931 -0.5118734 -1.986407 38 6921, 54....
51: 4.794440e-05 6.286903e-04 0.5573322 -0.5524893 -2.106255 36 2958, 68....
52: 1.321673e-11 4.477169e-10 0.8753251 -0.4451817 -2.334587 176 2958, 54....
53: 2.115669e-04 2.422590e-03 0.5188481 -0.3728299 -1.789787 100 2958, 54....
54: 4.136405e-07 7.006037e-06 0.6749629 -0.4320191 -2.126484 116 5478, 49....
55: 1.957664e-44 1.446892e-42 1.7387813 -0.7574919 -3.793314 129 6170, 47....
56: 1.000000e-50 8.130000e-49 NA -0.8692742 -4.102250 92 6170, 47....
57: 1.000000e-50 8.130000e-49 NA -0.8104961 -3.979237 114 6170, 47....
58: 1.541289e-25 6.265340e-24 1.3110148 -0.5696555 -3.029977 191 506, 542....
59: 9.228149e-04 8.525551e-03 0.4772708 -0.3777424 -1.724934 76 8673, 11....
60: 5.950260e-09 1.668125e-07 0.7614608 -0.4594739 -2.295736 125 5689, 57....
61: 3.367637e-43 2.281574e-41 1.7147970 -0.6700502 -3.579434 194 6170, 47....
62: 1.612246e-07 2.912791e-06 0.6901325 -0.4180823 -2.128723 137 6232, 56....
63: 1.570493e-05 2.280020e-04 0.5756103 -0.4016226 -1.930960 104 6632, 66....
64: 4.832716e-08 1.007435e-06 0.7195128 -0.6304098 -2.492947 40 6632, 66....
65: 1.170229e-08 2.891845e-07 0.7477397 -0.5764672 -2.483526 57 5689, 57....
66: 2.234667e-07 3.949531e-06 0.6901325 -0.5415920 -2.374227 61 5689, 57....
67: 6.753795e-09 1.830279e-07 0.7614608 -0.6371275 -2.546433 42 5689, 57....
68: 1.000000e-50 8.130000e-49 NA -0.9278116 -4.237599 78 6170, 47....
69: 9.781723e-07 1.559322e-05 0.6435518 -0.7222901 -2.410892 21 7411, 10....
70: 9.217399e-05 1.157502e-03 0.5384341 -0.3543974 -1.801202 135 6632, 66....
71: 9.655577e-04 8.820207e-03 0.4772708 -0.4813570 -1.833794 35 8802, 63....
72: 2.402736e-07 4.156222e-06 0.6749629 -0.5187248 -2.324521 69 7324, 56....
73: 1.000000e-50 8.130000e-49 NA -0.9080956 -4.271741 87 6170, 47....
74: 1.000000e-50 8.130000e-49 NA -0.9158695 -4.226378 83 6170, 47....
75: 9.280765e-28 4.715789e-26 1.3651796 -0.6082146 -3.174860 171 506, 542....
76: 1.353900e-08 3.144917e-07 0.7477397 -0.6232034 -2.555609 46 5689, 57....
77: 7.437584e-08 1.474818e-06 0.7049757 -0.5958419 -2.473595 48 5689, 57....
78: 2.233452e-08 4.778411e-07 0.7337620 -0.5960809 -2.509898 50 5689, 57....
79: 4.460310e-10 1.450493e-08 0.8140358 -0.6088265 -2.644210 58 5689, 57....
80: 1.616141e-08 3.649784e-07 0.7337620 -0.6217982 -2.549847 46 5689, 57....
81: 9.146997e-07 1.487302e-05 0.6594444 -0.4751757 -2.190291 82 2237, 56....
82: 7.351977e-08 1.474818e-06 0.7049757 -0.4790759 -2.267473 93 2237, 56....
83: 4.635933e-04 4.832068e-03 0.4984931 -0.4834178 -1.911668 40 5440, 51....
84: 4.023557e-05 5.362544e-04 0.5573322 -0.4756486 -2.047261 56 2958, 68....
85: 1.000000e-50 8.130000e-49 NA -0.8250959 -4.065058 112 6170, 47....
86: 3.759356e-24 1.389253e-22 1.2709130 -0.8207037 -3.471127 51 6217, 21....
87: 1.215655e-09 3.801258e-08 0.7881868 -0.6445795 -2.655455 47 6921, 56....
88: 7.163028e-04 6.851226e-03 0.4772708 -0.6992133 -2.015421 12 5440, 29....
89: 1.000000e-50 8.130000e-49 NA -0.9222092 -4.212012 78 6170, 47....
pval padj log2err ES NES size leadingEdge
significant_pathways_gsets_dysplasia <- gseapermgsets_dysplasia[gseapermgsets_dysplasia$padj < 0.01, ]
significant_pathways_gsets_ordered_dysplasia <- significant_pathways_gsets_dysplasia[order(significant_pathways_gsets_dysplasia$padj), ]
head(significant_pathways_gsets_ordered_dysplasia, n = 10)
pathway pval
<char> <num>
1: KEGG_RIBOSOME 1e-50
2: REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 1e-50
3: REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT 1e-50
4: REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 1e-50
5: REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 1e-50
6: REACTOME_PEPTIDE_CHAIN_ELONGATION 1e-50
7: REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT 1e-50
8: REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS 1e-50
9: REACTOME_TRANSLATION 1e-50
10: REACTOME_VIRAL_MRNA_TRANSLATION 1e-50
padj log2err ES NES size leadingEdge
<num> <num> <num> <num> <int> <list>
1: 8.13e-49 NA -0.9313928 -4.274983 80 6170, 47....
2: 8.13e-49 NA -0.9007982 -4.235589 88 6170, 47....
3: 8.13e-49 NA -0.8774321 -4.197048 99 6170, 47....
4: 8.13e-49 NA -0.8692742 -4.102250 92 6170, 47....
5: 8.13e-49 NA -0.8104961 -3.979237 114 6170, 47....
6: 8.13e-49 NA -0.9278116 -4.237599 78 6170, 47....
7: 8.13e-49 NA -0.9080956 -4.271741 87 6170, 47....
8: 8.13e-49 NA -0.9158695 -4.226378 83 6170, 47....
9: 8.13e-49 NA -0.8250959 -4.065058 112 6170, 47....
10: 8.13e-49 NA -0.9222092 -4.212012 78 6170, 47....
Create histograms of p-values and generate an enrichment plot.
par(mfrow=c(1, 2))
hist(gseapermpheno_dysplasia$pval, xlab="P-value", main="Permuting phenotypes", las=1)
hist(gseapermgsets_dysplasia$pval, xlab="P-value", main="Permuting gene sets", las=1)
plotEnrichment(gsets_dysplasia$CONTROL, stats_dysplasia)
plotEnrichment(gsets_dysplasia$TUMOR, stats_dysplasia)
Identify significant pathways.
DEpwys_dysplasia <- significant_pathways_gsets_ordered_dysplasia$pathway
plotGseaTable(gsets_dysplasia[DEpwys_dysplasia], stats_dysplasia, gseapermgsets_dysplasia, gseaParam=0.5)
We generate an HTML page so we can visualize more easily.
library(kableExtra)
gsea_tab <- kable(significant_pathways_gsets_ordered_dysplasia, "html", caption="GSEA Results Dysplasia")
gsea_tab <- kable_styling(gsea_tab, bootstrap_options=c("stripped", "hover", "responsive"), fixed_thead=TRUE)
save_kable(gsea_tab, file="./results/gsea_results_dysplasia.html", self_contained=TRUE)
See the HTML page by clicking on this link.
Additional validation using Camera.
library(limma)
gsetidx_dysplasia <- ids2indices(geneIds(gsc_dysplasia), rownames(se.filt))
camerares_dysplasia <- camera(y=assays(se.filt)$logCPM, index=gsetidx_dysplasia, design=mod, contrast=2)
head(camerares_dysplasia, 5)
NGenes Direction
KEGG_RIBOSOME 80 Down
REACTOME_PEPTIDE_CHAIN_ELONGATION 78 Down
REACTOME_VIRAL_MRNA_TRANSLATION 78 Down
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 88 Down
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 92 Down
PValue
KEGG_RIBOSOME 2.850098e-41
REACTOME_PEPTIDE_CHAIN_ELONGATION 3.820858e-41
REACTOME_VIRAL_MRNA_TRANSLATION 1.309456e-40
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 1.886464e-39
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 2.731443e-38
FDR
KEGG_RIBOSOME 1.591387e-38
REACTOME_PEPTIDE_CHAIN_ELONGATION 1.591387e-38
REACTOME_VIRAL_MRNA_TRANSLATION 3.635921e-38
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 3.928561e-37
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 4.550585e-36
Filter significant pathways from Camera.
camerares_dysplasia[camerares_dysplasia$FDR < 0.01, ]
NGenes
KEGG_RIBOSOME 80
REACTOME_PEPTIDE_CHAIN_ELONGATION 78
REACTOME_VIRAL_MRNA_TRANSLATION 78
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 88
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 92
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS 83
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT 99
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 114
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT 87
REACTOME_TRANSLATION 112
REACTOME_INFLUENZA_LIFE_CYCLE 129
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX 44
REACTOME_DIABETES_PATHWAYS 312
REACTOME_METABOLISM_OF_PROTEINS 194
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION 51
KEGG_PARKINSONS_DISEASE 111
KEGG_OXIDATIVE_PHOSPHORYLATION 114
REACTOME_ELECTRON_TRANSPORT_CHAIN 71
KEGG_ALZHEIMERS_DISEASE 139
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION 140
KEGG_HUNTINGTONS_DISEASE 157
REACTOME_REGULATION_OF_INSULIN_SECRETION 171
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G 47
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING 15
KEGG_PROTEASOME 41
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE 42
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE 46
REACTOME_STABILIZATION_OF_P53 46
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ 55
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1 48
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM 191
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 50
REACTOME_SIGNALING_BY_WNT 58
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC 21
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A 61
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX 50
REACTOME_GENE_EXPRESSION 397
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC 56
REACTOME_M_G1_TRANSITION 57
REACTOME_G1_S_TRANSITION 88
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE 69
KEGG_PROTEIN_EXPORT 23
BIOCARTA_ETC_PATHWAY 12
REACTOME_DNA_REPLICATION_PRE_INITIATION 68
BIOCARTA_RAB_PATHWAY 10
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN 61
REACTOME_METABOLISM_OF_AMINO_ACIDS 125
REACTOME_CELL_CYCLE_CHECKPOINTS 97
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC 15
REACTOME_S_PHASE 93
REACTOME_HIV_INFECTION 176
REACTOME_SYNTHESIS_OF_DNA 82
BIOCARTA_PROTEASOME_PATHWAY 19
KEGG_CARDIAC_MUSCLE_CONTRACTION 46
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS 116
REACTOME_MRNA_SPLICING_MINOR_PATHWAY 40
BIOCARTA_SALMONELLA_PATHWAY 10
BIOCARTA_CDC42RAC_PATHWAY 14
REACTOME_CHOLESTEROL_BIOSYNTHESIS 19
REACTOME_MITOTIC_M_M_G1_PHASES 137
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS 12
REACTOME_HIV1_TRANSCRIPTION_INITIATION 36
KEGG_CITRATE_CYCLE_TCA_CYCLE 27
Direction
KEGG_RIBOSOME Down
REACTOME_PEPTIDE_CHAIN_ELONGATION Down
REACTOME_VIRAL_MRNA_TRANSLATION Down
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS Down
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION Down
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS Down
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT Down
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION Down
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT Down
REACTOME_TRANSLATION Down
REACTOME_INFLUENZA_LIFE_CYCLE Down
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX Down
REACTOME_DIABETES_PATHWAYS Down
REACTOME_METABOLISM_OF_PROTEINS Down
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION Down
KEGG_PARKINSONS_DISEASE Down
KEGG_OXIDATIVE_PHOSPHORYLATION Down
REACTOME_ELECTRON_TRANSPORT_CHAIN Down
KEGG_ALZHEIMERS_DISEASE Down
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION Down
KEGG_HUNTINGTONS_DISEASE Down
REACTOME_REGULATION_OF_INSULIN_SECRETION Down
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G Down
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING Down
KEGG_PROTEASOME Down
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE Down
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE Down
REACTOME_STABILIZATION_OF_P53 Down
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ Down
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1 Down
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM Down
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 Down
REACTOME_SIGNALING_BY_WNT Down
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC Down
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A Down
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX Down
REACTOME_GENE_EXPRESSION Down
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC Down
REACTOME_M_G1_TRANSITION Down
REACTOME_G1_S_TRANSITION Down
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE Down
KEGG_PROTEIN_EXPORT Down
BIOCARTA_ETC_PATHWAY Down
REACTOME_DNA_REPLICATION_PRE_INITIATION Down
BIOCARTA_RAB_PATHWAY Down
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN Down
REACTOME_METABOLISM_OF_AMINO_ACIDS Down
REACTOME_CELL_CYCLE_CHECKPOINTS Down
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC Down
REACTOME_S_PHASE Down
REACTOME_HIV_INFECTION Down
REACTOME_SYNTHESIS_OF_DNA Down
BIOCARTA_PROTEASOME_PATHWAY Down
KEGG_CARDIAC_MUSCLE_CONTRACTION Down
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS Down
REACTOME_MRNA_SPLICING_MINOR_PATHWAY Down
BIOCARTA_SALMONELLA_PATHWAY Down
BIOCARTA_CDC42RAC_PATHWAY Down
REACTOME_CHOLESTEROL_BIOSYNTHESIS Down
REACTOME_MITOTIC_M_M_G1_PHASES Down
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS Down
REACTOME_HIV1_TRANSCRIPTION_INITIATION Down
KEGG_CITRATE_CYCLE_TCA_CYCLE Down
PValue
KEGG_RIBOSOME 2.850098e-41
REACTOME_PEPTIDE_CHAIN_ELONGATION 3.820858e-41
REACTOME_VIRAL_MRNA_TRANSLATION 1.309456e-40
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 1.886464e-39
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 2.731443e-38
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS 1.145772e-35
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT 1.225223e-35
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 1.019429e-33
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT 2.283164e-33
REACTOME_TRANSLATION 1.171285e-32
REACTOME_INFLUENZA_LIFE_CYCLE 3.033430e-22
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX 5.392902e-22
REACTOME_DIABETES_PATHWAYS 8.692926e-22
REACTOME_METABOLISM_OF_PROTEINS 5.349591e-20
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION 9.395454e-20
KEGG_PARKINSONS_DISEASE 1.839192e-15
KEGG_OXIDATIVE_PHOSPHORYLATION 3.491358e-15
REACTOME_ELECTRON_TRANSPORT_CHAIN 5.005995e-15
KEGG_ALZHEIMERS_DISEASE 3.306168e-12
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION 4.568293e-11
KEGG_HUNTINGTONS_DISEASE 3.248647e-10
REACTOME_REGULATION_OF_INSULIN_SECRETION 1.611900e-09
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G 2.700881e-09
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING 1.052427e-08
KEGG_PROTEASOME 1.058222e-08
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE 3.144120e-08
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE 3.740179e-08
REACTOME_STABILIZATION_OF_P53 6.522039e-08
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ 7.748487e-08
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1 8.120236e-08
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM 9.625664e-08
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 1.853028e-07
REACTOME_SIGNALING_BY_WNT 1.932560e-07
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC 3.724800e-07
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A 4.225662e-07
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX 5.652607e-07
REACTOME_GENE_EXPRESSION 6.425514e-07
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC 7.186981e-07
REACTOME_M_G1_TRANSITION 1.059953e-06
REACTOME_G1_S_TRANSITION 1.682678e-06
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE 1.857432e-06
KEGG_PROTEIN_EXPORT 2.806326e-06
BIOCARTA_ETC_PATHWAY 3.925179e-06
REACTOME_DNA_REPLICATION_PRE_INITIATION 4.241129e-06
BIOCARTA_RAB_PATHWAY 7.089697e-06
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN 7.910600e-06
REACTOME_METABOLISM_OF_AMINO_ACIDS 1.757114e-05
REACTOME_CELL_CYCLE_CHECKPOINTS 1.821139e-05
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC 2.301368e-05
REACTOME_S_PHASE 2.778510e-05
REACTOME_HIV_INFECTION 3.533418e-05
REACTOME_SYNTHESIS_OF_DNA 5.736754e-05
BIOCARTA_PROTEASOME_PATHWAY 1.466427e-04
KEGG_CARDIAC_MUSCLE_CONTRACTION 1.928235e-04
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS 2.272538e-04
REACTOME_MRNA_SPLICING_MINOR_PATHWAY 2.721551e-04
BIOCARTA_SALMONELLA_PATHWAY 2.757882e-04
BIOCARTA_CDC42RAC_PATHWAY 3.143116e-04
REACTOME_CHOLESTEROL_BIOSYNTHESIS 4.647822e-04
REACTOME_MITOTIC_M_M_G1_PHASES 4.659876e-04
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS 4.930346e-04
REACTOME_HIV1_TRANSCRIPTION_INITIATION 4.994264e-04
KEGG_CITRATE_CYCLE_TCA_CYCLE 5.506234e-04
FDR
KEGG_RIBOSOME 1.591387e-38
REACTOME_PEPTIDE_CHAIN_ELONGATION 1.591387e-38
REACTOME_VIRAL_MRNA_TRANSLATION 3.635921e-38
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 3.928561e-37
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 4.550585e-36
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS 1.458015e-33
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT 1.458015e-33
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 1.061480e-31
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT 2.113196e-31
REACTOME_TRANSLATION 9.756803e-31
REACTOME_INFLUENZA_LIFE_CYCLE 2.297134e-20
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX 3.743573e-20
REACTOME_DIABETES_PATHWAYS 5.570160e-20
REACTOME_METABOLISM_OF_PROTEINS 3.183007e-18
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION 5.217609e-18
KEGG_PARKINSONS_DISEASE 9.575296e-14
KEGG_OXIDATIVE_PHOSPHORYLATION 1.710766e-13
REACTOME_ELECTRON_TRANSPORT_CHAIN 2.316663e-13
KEGG_ALZHEIMERS_DISEASE 1.449494e-10
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION 1.902694e-09
KEGG_HUNTINGTONS_DISEASE 1.288630e-08
REACTOME_REGULATION_OF_INSULIN_SECRETION 6.103241e-08
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G 9.781886e-08
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING 3.525994e-07
KEGG_PROTEASOME 3.525994e-07
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE 1.007328e-06
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE 1.153915e-06
REACTOME_STABILIZATION_OF_P53 1.940307e-06
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ 2.225686e-06
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1 2.254719e-06
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM 2.586509e-06
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 4.823664e-06
REACTOME_SIGNALING_BY_WNT 4.878251e-06
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC 9.125760e-06
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A 1.005707e-05
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX 1.307950e-05
REACTOME_GENE_EXPRESSION 1.446609e-05
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC 1.575462e-05
REACTOME_M_G1_TRANSITION 2.263951e-05
REACTOME_G1_S_TRANSITION 3.504176e-05
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE 3.773759e-05
KEGG_PROTEIN_EXPORT 5.565880e-05
BIOCARTA_ETC_PATHWAY 7.603893e-05
REACTOME_DNA_REPLICATION_PRE_INITIATION 8.029228e-05
BIOCARTA_RAB_PATHWAY 1.312382e-04
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN 1.432507e-04
REACTOME_METABOLISM_OF_AMINO_ACIDS 3.114204e-04
REACTOME_CELL_CYCLE_CHECKPOINTS 3.160435e-04
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC 3.912325e-04
REACTOME_S_PHASE 4.628998e-04
REACTOME_HIV_INFECTION 5.771249e-04
REACTOME_SYNTHESIS_OF_DNA 9.189839e-04
BIOCARTA_PROTEASOME_PATHWAY 2.304781e-03
KEGG_CARDIAC_MUSCLE_CONTRACTION 2.974481e-03
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS 3.441862e-03
REACTOME_MRNA_SPLICING_MINOR_PATHWAY 4.030378e-03
BIOCARTA_SALMONELLA_PATHWAY 4.030378e-03
BIOCARTA_CDC42RAC_PATHWAY 4.514165e-03
REACTOME_CHOLESTEROL_BIOSYNTHESIS 6.469461e-03
REACTOME_MITOTIC_M_M_G1_PHASES 6.469461e-03
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS 6.710035e-03
REACTOME_HIV1_TRANSCRIPTION_INITIATION 6.710035e-03
KEGG_CITRATE_CYCLE_TCA_CYCLE 7.280464e-03
Visualise significant pathways from Camera for Dysplasia.
hist(camerares_dysplasia$PValue, xlab="P-value", main="Camera", las=1)
We ended up following two different pipelines and therefore reached two sets of differentially expressed genes and pathways. This proved to be fruitful seeing as both current vs. former smokers and dysplasia vs. no dysplasia have statistically significant differencial expression. The results of the differential expression were significant enough to perform functional analysis on and find out how the genome changes between current smokers vs former smokers, and how does it affect the state of their ariway tissue (dysplasia).
We decided on this aim for our analysis because of the results we got from doing Experimental Data Analysis on the dataset. We looked at all of the phenotypical and technical variables to find something that may be worth of interest or that might output statistical significant results. What we found during Experimental Dessign and batch identification is that our samples clustered together (with some exceptions) in two groups. After some trial and error, we found the reason for this clustering is the divide between the groups of samples that are current smokers vs. former smokers. We went ahead and decided to explore this path.
A second analysis seemed pertinent, on a different variable to be able to compare the two. Dysplasia vs. no dysplasia was an appropriate group, especially since it describes the state of damage of the tissue, which should be related to wether or not someone smokes.
As we suspected during EDA and Experimental Design, the results from our Differential Expression Analysis in Current smokers vs. Former smokers were of high statistical significance. We found a very high number of DE genes, some of the top ones being up until 26 times more upregulated in one group. Our distribution of p-values shows how most of them are very close to 0, and the volcano plot also helps confirm there is significant Differential Gene Expression between samples belonging to former smokers vs. samples belonging to current smokers.
In our analysis, we aimed to explore the effects of smoking on cellular functions using Gene Set Enrichment Analysis (GSEA) and Gene Ontology (GO) analysis. We conducted these analyses on samples with dysplasia versus no dysplasia and current smokers versus former smokers.
Both GSEA and GO analysis in our sets of DE genes (dysplasia vs no dysplasia and current vs former smoker) revealed that smoking significantly impacts various cellular pathways, particularly those related to ribosomal function, protein synthesis, mitochondrial function, and metabolic processes (REACTOME_TRANSLATION, REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT and REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS). This illustrates that chronic exposure to cigarette smoke can lead to oxidative stress, damaging cellular components and disrupting normal protein synthesis and function.
Smoking affects the expression and function of ribosomal proteins, which are critical for protein biogenesis (KEGG_RIBOSOME). Our findings align with the study by Park et al. (2021), which also identified a detrimental effect of smoking on ribosomal function and protein synthesis. Exposure to electronic cigarette smoke reduces ribosomal protein gene expression, impairing protein synthesis in primary human airway epithelial cells. The reduced expression of ribosomal proteins and subsequent impairment in protein synthesis could lead to various cellular dysfunctions and contribute to the pathogenesis of various smoking-related diseases, including cancer.
Further findings indicate that smoking disrupts cytoplasmic translation and peptide chain elongation (REACTOME_PEPTIDE_CHAIN_ELONGATION). The study by Merzah et al. (2023) supports these observations, revealing differentially expressed genes (DEGs) related to immune response and inflammation pathways in smokers. Smoking leads to significant downregulation of genes involved in oxygen binding and activity, emphasizing the role of oxidative stress in disrupting cellular functions, contributing to disease development.
Additionally, our analyses identified pathways related to mitochondrial electron transport and ATP synthesis (KEGG_OXIDATIVE_PHOSPHORYLATION). Smoking introduces numerous toxins that induce oxidative stress, resulting in the production of reactive oxygen species (ROS), which can damage cellular components, including mitochondria. This oxidative stress impairs mitochondrial function, leading to a cascade of detrimental effects on cellular health. The study by Kanithi et al. (2022) highlighted the impact of cigarette smoke and e-cigarette vapor on mitochondrial function by increasing oxidative stress, disrupting mitochondrial morphology, and impairing critical processes such as fusion, fission, and mitophagy.
Smoking-induced oxidative stress damages the electron transport chain, leading to reduced ATP production and impaired cellular energy metabolism. The resulting energy deficit can hinder cellular processes, contributing to the development of diseases such as cancer and cardiovascular conditions. The study by Tan et al. (2008) concluded that cigarette smoke exposure affects mtDNA in buccal cells of smokers. Additionally, Fetterman, Sammy, and Ballinger (2017) reported that active and passive cigarette smoke decreases mitochondrial respiration and membrane potential, leading to decreased ATP content and increased oxidant production in various tissues, exacerbating cellular dysfunction and contributing to smoking-related diseases.
Pathways like REACTOME_INSULIN_SYNTHESIS_AND_SECRETION and REACTOME_GLUCOS_REGULATION_OF_INSULIN_SECRETION point to potential disruptions in insulin signaling. Smoking has been associated with insulin resistance (Facchini et al. (1992)) and an increased risk of developing type 2 diabetes (Will et al. (2001)) This could be due to inflammation and oxidative stress induced by smoking, affecting pancreatic beta cells’ ability to produce insulin and cells’ sensitivity to insulin.
GO terms such as tryptophan transport, valine transport, and thyroid hormone transport highlight changes in the transport of essential molecules. Smoking can alter the expression of various transporters on cell membranes, impacting nutrient uptake and hormone distribution. Babić Leko et al. (2021) concluded that smoking leads to a decrease in TSH levels and an increase in T3 and T4 levels. The mechanism through which cigarette smoking affects TSH and thyroid hormone levels remains unclear. Changes in amino acid transport can affect protein synthesis and cellular repair mechanisms.
The GO terms daunorubicin metabolic process and doxorubicin metabolic process suggest differences in how former and current smokers metabolize specific substances. These findings align with research showing that smoking induces certain cytochrome P450 enzymes involved in drug metabolism, leading to altered pharmacokinetics and potentially affecting the efficacy and toxicity of various medications (Zhao et al. (2021)).
Collectively, these findings underscore the broad impact of smoking on critical biological functions, affecting everything from protein synthesis to metabolic and transport processes. The impairment of these pathways highlights the profound and multifaceted consequences of smoking on cellular health, contributing to the development of various smoking-related diseases.
Differential Expression and Functional Analysis allowed us to identify differentially expressed genes and pathways between current smokers and former smokers, as well as samples with damaged lung tissue (dysplasia) vs. healthy lung tissue (no dysplasia).
We wanted to gain some insight into the effects of smoking on the airway, and whether or not there is significance in smokers vs. people that quit smoking. The fact that there is very significant differential expression between these groups, and that there are pathways that are differentially expressed, tells us that there are significant changes to the genome once the smoking stops. Comparing it to dysplasia status, and seeing so many pathways that are differentially expressed in one classification as well as the other, could indicate there’s a correlation between Histology degree and the smoking habit.
In our analysis of the effects of smoking on cellular functions using Differential Expression (DE), Gene Ontology (GO), and Gene Set Enrichment Analysis (GSEA) on RNA-seq data, we discovered significant molecular alterations between current and former smokers, and individuals with and without dysplasia.
Even though our process did yield significant results, we think it’s still superficial. We could add to this analysis by also sequencing genome from samples that never smoked, or try and go further into the changes the genome goes through once one stops smoking. We think this is a very relevant topic seeing as smoking is still really prevalent, and lung cancer is one of the leading causes of death in the United States.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods
[8] base
other attached packages:
[1] dplyr_1.1.4 msigdbr_7.5.1
[3] GSVAdata_1.40.0 hgu95a.db_3.13.0
[5] GSEABase_1.66.0 fgsea_1.30.0
[7] GO.db_3.19.1 org.Hs.eg.db_3.19.1
[9] GOstats_2.70.0 graph_1.82.0
[11] Category_2.70.0 Matrix_1.7-0
[13] RColorBrewer_1.1-3 ggplot2_3.5.1
[15] geneplotter_1.82.0 annotate_1.82.0
[17] XML_3.99-0.17 AnnotationDbi_1.66.0
[19] lattice_0.22-6 edgeR_4.2.1
[21] limma_3.60.3 SummarizedExperiment_1.34.0
[23] Biobase_2.64.0 GenomicRanges_1.56.1
[25] GenomeInfoDb_1.40.1 IRanges_2.38.1
[27] S4Vectors_0.42.1 BiocGenerics_0.50.0
[29] MatrixGenerics_1.16.0 matrixStats_1.3.0
[31] here_1.0.1 kableExtra_1.4.0
[33] knitr_1.48 BiocStyle_2.32.1
loaded via a namespace (and not attached):
[1] bitops_1.0-7 DBI_1.2.3 RBGL_1.80.0
[4] rlang_1.1.4 magrittr_2.0.3 compiler_4.4.1
[7] RSQLite_2.3.7 png_0.1-8 systemfonts_1.1.0
[10] vctrs_0.6.5 stringr_1.5.1 pkgconfig_2.0.3
[13] crayon_1.5.3 fastmap_1.2.0 magick_2.8.4
[16] XVector_0.44.0 labeling_0.4.3 utf8_1.2.4
[19] rmarkdown_2.27 UCSC.utils_1.0.0 tinytex_0.51
[22] bit_4.0.5 xfun_0.45 zlibbioc_1.50.0
[25] cachem_1.1.0 jsonlite_1.8.8 blob_1.2.4
[28] highr_0.11 DelayedArray_0.30.1 BiocParallel_1.38.0
[31] parallel_4.4.1 R6_2.5.1 bslib_0.7.0
[34] stringi_1.8.4 genefilter_1.86.0 jquerylib_0.1.4
[37] Rcpp_1.0.12 bookdown_0.40 splines_4.4.1
[40] tidyselect_1.2.1 rstudioapi_0.16.0 abind_1.4-5
[43] yaml_2.3.9 codetools_0.2-20 tibble_3.2.1
[46] withr_3.0.0 KEGGREST_1.44.1 evaluate_0.24.0
[49] AnnotationForge_1.46.0 survival_3.7-0 xml2_1.3.6
[52] Biostrings_2.72.1 pillar_1.9.0 BiocManager_1.30.23
[55] KernSmooth_2.23-24 renv_1.0.7 generics_0.1.3
[58] RCurl_1.98-1.16 rprojroot_2.0.4 munsell_0.5.1
[61] scales_1.3.0 xtable_1.8-4 glue_1.7.0
[64] tools_4.4.1 data.table_1.15.4 locfit_1.5-9.10
[67] babelgene_22.9 fastmatch_1.1-4 cowplot_1.1.3
[70] grid_4.4.1 colorspace_2.1-0 GenomeInfoDbData_1.2.12
[73] cli_3.6.3 fansi_1.0.6 S4Arrays_1.4.1
[76] viridisLite_0.4.2 svglite_2.1.3 Rgraphviz_2.48.0
[79] gtable_0.3.5 sass_0.4.9 digest_0.6.36
[82] SparseArray_1.4.8 farver_2.1.2 memoise_2.0.1
[85] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
[88] statmod_1.5.0 bit64_4.0.5